29 Nov 2015
I never know whether to prefer closure or fresh questions, in research. I got a little of both lately. First, closure on searching in permutations of sorted arrays, and second, more avenues on the compilation as compression angle.
Array Layouts for Comparison-Based Searching
In July 2012, I started really looking into searching in static sorted sets, and found the literature disturbingly useless. I reviewed a lot of code, and it turned out that most binary searches out there are not only unsafe for overflow, but also happen to be badly micro-optimised for small arrays with simple comparators. That lead to Binary Search eliminates Branch Mispredictions, a reaction to popular assertions that binary search has bad constant factors (compared to linear search or a breadth first layout) on modern microarchitecture, mostly due to branch mispredictions. That post has code for really riced up searches on fixed array sizes, so here's the size generic inner loop I currently use.
1 2 3 4 5 6 7 8 9 10
The snippet above implements a binary search, instead of dividing by three to avoid aliasing issues. That issue only shows up with array sizes that are (near) powers of two. I know of two situations where that happens a lot:
- bad benchmarks;
- data structures built around powers of two (e.g., Saxe & Bentley dynamisation).
The fix for the first case is to do proper benchmarking on a wide range of input sizes. Ternary or offset binary search are only really useful in the second case. There's actually a third case: when I'm about to repeatedly search in the same array, I dispatch to unrolled ternary searches, with one routine for each power of two. I can reduce any size to a power of two with one initial iteration on an off-center "midpoint." Ternary search has a high overhead for small arrays, unless we can precompute offsets by unrolling the whole thing.
My work on binary search taught me how to implement binary search not stupidly-unlike real implementations-and that most experiments on searching in array permutations seem broken in their very design (they focus on full binary trees).
I don't think I ever make that explicit, but the reason I even started looking into binary search is that I wanted to have a fast implementation of searching in a van Emde Boas layout! However, none of the benchmarks (or analyses) I found were convincing, and I kind of lost steam as I improved sorted arrays: sortedness tends to be useful for operations other than predecessor/successor search.
Some time in May this year, I found Pat Morin's fresh effort on the exact question I had abandoned over the years: how do popular permutations work in terms of raw CPU time? The code was open, and even good by research standards! Pat had written the annoying part (building the permutations), generated a bunch of tests I could use to check correctness, and avoided obvious microbenchmarking pitfalls. He also found a really nice way to find the return value for BFS searches from the location where the search ends with fast bit operations (
j = (i + 1) >> __builtin_ffs(~(i + 1));, which he explains in the paper).
I took that opportunity to improve constant factors for all the implementations, and to really try and explain in detail the performance of each layout with respect to each other, as well as how they respond to the size of the array. That sparked a very interesting back and forth with Pat from May until September (!). Pat eventually took the time to turn our informal exchange into a coherent paper. More than 3 years after I started spending time on the question of array layouts for implicit search trees, I found the research I was looking for… all it took was a bit of collaboration (:
Bonus: the results were unexpected! Neither usual suspects (B-tree or van Emde Boas) came out on top, even for very large arrays. I was also surprised to see the breadth-first layout perform much better than straight binary search: none of the usual explanations made sense to me. It turns out that the improved performance (when people weren't testing on round, power of two, array sizes) was probably an unintended consequence of bad code! Breadth-first is fast, faster than layouts with better cache efficiency, because it prefetches well enough to hide latency even when it extracts only one bit of information from each cache line; its performance has nothing to do with cachability. Our code prefetches explicitly, but slower branchy implementations in the wild get implicit prefetching, thanks to speculative execution.
Conclusion: if you need to do a lot of comparison-based searches in > L2-sized arrays, use a breadth-first order and prefetch. If you need sorted arrays, consider sticking some prefetches in a decent binary search loop. If only I'd known that in 2012!
A couple months ago, I found LZ77-like Compression with Fast Random Access by Kreft and Navarro. They describe a Lempel-Ziv approach that is similar to LZ77, but better suited to decompressing arbitrary substrings. The hard part about applying LZ77 compression to (byte)code is that parses may reuse any substring that happens to appear earlier in the original text. That's why I had to use Jez's algorithm to convert the LZ77 parse into a (one word) grammar.
LZ-End fixes that.
Kreft and Navarro improve random access decompression by restricting the format of "backreferences" in the LZ parse. The parse decomposes the original string into a sequence of phrases; concatenating the phrases yields the string back, and phrases have a compact representation. In LZ77, phrases are compressed because they refer back to substrings in prior phrases. LZ-End adds another constraint: the backreferences cannot end in the middle of phrases.
For example, LZ77 might have a backreference like
to represent "cdefg." LZ-End would be forced to end the new phrase at "f"
and only represent "cdef." The paper shows that this additional restriction has a marginal impact on compression rate, and uses the structure to speed up operations on compressed data. (The formal definition also forbids the cute/annoying self-reference-as-loop idiom of LZ77, without losing too much compression power!)
We can apply the same idea to compress code. Each phrase is now a subroutine with a
return at the end. A backreference is a series of calls to subroutines; the first call might begin in the middle, but matches always end on
return, exactly like normal code does! A phrase might begin in the middle of a phrase that itself consists of calls. That's still implementable: the referrer can see through the indirection and call in the middle of the callee's callee (etc.), and then go back to the callee for a suitably aligned submatch.
That last step looks like it causes a space blowup, and I can't bound it (yet).
But that's OK, because I was only looking into compressing traces as a last resort. I'm much more interested in expression trees, but couldn't find a way to canonicalize sets (e.g., arguments to integer
+) and sequences (e.g., floating point
*) so that similar collections have similar subtrees… until I read Hammer et al's work on Nominal Adapton, which solves a similar problem in a different context.
They want a tree representation for lists and tries (sets/maps) such that a small change in the list/trie causes a small change in the tree that mostly preserves identical subtrees. They also want the representation to be a deterministic function of the list/trie. That way, they can efficiently reuse computations after incremental changes to inputs.
That's exactly my sequence/set problem! I want a tree-based representation for sequences (lists) and sets (tries) such that similar sequences and sets have mostly identical subtrees for which I can reuse pre-generated code.
Nominal Adapton uses a hash-based construction described by Pugh and Teitelbaum in 1989 (Incremental computation via function caching) to represent lists, and extends the idea for tries. I can "just" use the same trick to canonicalise lists and sets into binary trees, and (probabilistically) get common subexpressions for free, even across expressions trees! It's not perfect, but it should scale pretty well.
That's what I'm currently exploring when it comes to using compression to reduce cache footprint while doing aggressive specialisation. Instead of finding redundancy in linearised bytecode after the fact, induce identical subtrees for similar expressions, and directly reuse code fragments for subexpressions.
More to come
I thought I'd post a snippet on the effect of alignment and virtual memory tricks on TLBs, but couldn't find time for that. Perhaps later this week. In the meantime, I have to prepare a short talk on the software transactional memory system we built at AppNexus. Swing by 23rd Street on December 15 if you're in New York!
29 Nov 2015 8:25am GMT
24 Nov 2015
At its core, Common Lisp provides two primitives for performing iteration. The first of those primitives is recursion. Recursion is an amazing technique, but in this post I am going to focus on the other primitive - goto.
Goto is extremely powerful. It lets you manipulate the control flow of your program in anyway you can think of. This freedom to do whatever you want is also what makes goto so dangerous. In any given piece of code that uses goto, it is difficult to tell what the purpose of the goto is because it could be used for so many different reasons. Because of this, most languages provide various kinds of builtin loops instead of providing raw goto. Even though loops aren't as general as goto, they express the intention of the code much more clearly.
As an example, let's say you want to print all of the characters in a file. If your language provided while loops, you could do this by printing characters from the file one at a time while there are more characters left. If Common Lisp had while loops,1 the code for this procedure would look like this:
(while (peek-char file nil nil) (write-char (read-char file)))
If your language only had goto, it becomes much more difficult to implement the procedure. In the end, you have to, in some way, simulate a while loop. One way to code the procedure with just goto is the following. First check if there are any characters left in the file. If there aren't any, goto the end. Otherwise print the next character and go back to the start. Here is Common Lisp code that implements this:2
(tagbody start (if (not (peek-char file nil nil)) (go end)) (write-char (read-char file)) (go start) end)
Not only is the version with goto much more verbose, it is also much harder to understand. The code lacks clarity because goto is so general. It gives you no context into how it is being used. The reader of the code will have to think about the positioning of all of the gotos before they can think about the overall flow of the program. On the other hand, in the version with the while loop, merely the fact that a while loop is being used gives whoever is reading the code a decent idea of the control flow.
In reality all loops are eventually compiled down to gotos. Whenever the compiler for a language that provides loops sees a loop, it generates code that simulates the loop through goto. You can do the same thing with Lisp macros!
If you don't know, Lisp macros are compile time functions which take code as their input and return code as their output. When Lisp code is being compiled, all of the macros in the code are called and each one is replaced with its result. This means you can write a macro that looks like a while loop when you use it, but at compile time generates code to simulate a while loop through goto. You are in effect adding while loops to the Lisp compiler! Here is code that defines such a macro:
(defmacro while (test &body body) (let ((gtop (gensym)) (gend (gensym))) `(tagbody ,gtop (if (not ,test) (go ,gend)) ,@body (go ,gtop) ,gend)))
With this macro, the first code example is now valid lisp code! The while macro takes as arguments a test and a body. It then generates code that uses the method used in the second example to simulate a while loop with goto. You can actually see what the first example looks like after expanding the macro by using the function macroexpand. Here is what the generated code looks like:
(tagbody #:g729 (if (not (peek-char file nil nil)) (go #:g730)) (write-char (read-char file)) (go #:g729) #:g730)
The generated code is the exact same as the code in the second example except for the names of the labels. This means the two examples are the same functionally! The only real difference between them is that the first one is expressed in terms of loops, and the second one is expressed in terms of goto. Since it is so much easier to think in terms of loops than goto, there is no reason why you wouldn't use the first example over the second.
Macros allow you to build any feature you want as long as it is possible to simulate that feature through lower level features. With respect to goto, this means you can build any kind of control flow construct you want by simulating it with goto and then putting a macro on top. In Common Lisp, all of the looping constructs (do, do*, dotimes, dolist, loop) are really just macros that expand into goto. This is what Alan Kay meant when he said "Lisp isn't a language, it's a building material". It bears repeating. In Lisp, you can build any feature you want as long as it is possible to simulate that feature in terms of lower level features.
24 Nov 2015 4:00pm GMT
23 Nov 2015
What makes any language suitable for scientific computing?
The most important goal is translating ideas into fast code and building on other people's work. We are working on improving Clasp's ability to generate fast code based on the excellent LLVM library and Clasp can expose C++/C/Fortran libraries to build on the work of others. I've programmed in many languages and Common Lisp is the best at expressing ideas. Every language gets translated into an abstract syntax tree on its way to native code, Lisp code is an abstract syntax tree. There is no programming concept that can't be expressed compactly in Lisp, this is not true in other languages. You cannot yet express multiple dispatch functions, dynamic variables or Common Lisp style macros (a few CL features) compactly in R, Fortran, C++, or Python.
Why are R, Fortran, C++, or Python considered suitable for scientific computing?
It is the wrong question - those languages are just the languages that people started using and so they keep using them. It is not a choice most people make - it is inertia.
I choose Common Lisp.
23 Nov 2015 4:58am GMT
20 Nov 2015
Announcing Clasp 0.4 - a new release that incorporates a brand new compiler - capable of generating 200x faster code than previously, many bug fixes, a more complete Common Lisp implementation, and C++ interoperation.
- Clasp has a completely new, optimizing/inlining compiler called cclasp!
- Fixnum, character and single-float types are immediate values.
- General object pointers and cons pointers are tagged for speed.
- Clbind library allows programmers to expose external C++ libraries.
- Lots of bug fixes and stability improvements.
What is Clasp?
Clasp is a new Common Lisp implementation that seamlessly interoperates with C++ libraries using LLVM for compilation to native code. This allows Clasp to take advantage of a vast array of preexisting libraries and programs, such as out of the scientific computing ecosystem. Embedding them in a Common Lisp environment allows you to make use of rapid prototyping, incremental development, and other capabilities that make Common Lisp such a powerful language.
Precompiled and prepackaged versions of Clasp will be available for a limited number of distributions. Check the releases to see if there is something available for you.
At the moment, Clasp is supported on Linux and Mac OS X. On these systems, you should be able to build it from source if a pre-made package is not available or workable for you. In case you cannot get it to compile even with the instructions below, the quickest way to get help is to either file an issue, or to chat with us directly.
Building on most systems will take around 4GB of RAM and 2-4 hours with a relatively modern processor.
Acknowledgements (#clasp IRC channel nicknames)
Robert Strandh (beach) - the Cleavir compiler and guidance.
Shinmera - build testing and organization.
stassats - guidance, bug finding and speeding up the compiler.
flash- - feedback and debugging of the clbind library.
SAL9000 - designing list iterator and feedback on ASTMatcher library.
faheem - guidance in setting up the build system.
20 Nov 2015 9:44pm GMT
14 Nov 2015
Sometimes I want to create a directory pathname relative to some existing file pathname. For example, when I want to refer to files that are in the same directory as the currently loading file, I might work relative to
I used to do it like this:
(make-pathname :directory (pathname-directory *load-truename*))
This worked fine for me, but after getting bug reports from Windows users, I've changed it to this:
(make-pathname :defaults *load-truename* :type nil :name nil :version nil)
That way all the properties of
*load-truename* are carried over, except the name and type.
:version nil per Fare's comment.
14 Nov 2015 7:08pm GMT
10 Nov 2015
In my last post I talked about memoization i.e. caching the results of a function. Memoization is a fairly common technique for optimization. It is common enough to warrant writing a macro that makes it easy to define memoized functions. When demonstrating memoization, I had a memoized Fibonacci function that looked like this:1
(let ((table (make-hash-table))) (defun fib (n) (or (gethash n table) (setf (gethash n table) (if (<= 0 n 1) n (+ (fib (- n 1)) (fib (- n 2))))))))
There are a couple problems with the above code. One problem is the boilerplate. If you wanted ten different memoized functions, you would have to copy lines 1, 3, and 4 for every single memoized function. Some people like to call programmers who do this needless duplication, "human compilers", since they are writing code that the compiler should be writing for them.
Another issue with the above code is the lack of abstraction. If you wanted to change the caching mechanism to say, only cache the last hundred values, you would have to change the definition of every single function! Ideally you would only need to modify the code in one place in order to change how the caching is implemented.
Defmemo is one way to solve both of these problems. Here is what the above code would look like if it were were to use defmemo:
(defmemo fib (n) (if (<= 0 n 1) n (+ (fib (- n 1)) (fib (- n 2)))))
Defmemo solves both of the problems extremely well. It removes all of the differences between the memoized version on the regular version except for having to use "defmemo" instead of "defun". Defmemo also solves the abstraction problem by moving all of the code relevant to memoization into the body of defmemo. If you want to change how memoization works, all you have to do is change the code for defmemo.
Now for the implementation of defmemo. The implementation is made up of two separate parts. First, a higher order function, memo, which takes a function as an argument, and returns a memoized version of that function. The second part is the actual macro, defmemo. Instead of just defining the function like defun, defmemo first builds a lambda expression for the body. Then it generates code that calls memo on that lambda function. Finally defmemo uses the result of memo as the implementation of the function being defined.2
(defun memo (f) (let ((cache (make-hash-table :test #'equalp))) (lambda (&rest args) (or (gethash args cache) (setf (gethash args cache) (apply f args))))))
Memo works by returning a function that has an internal hash-table. When that function is called, it first checks its hash-table to see if it has been called with the same arguments before. If so, it returns the value it had calculated the first time it was called.5 If it hasn't been called with the same arguments before, the function will instead call the function that was passed in to memo, and then store the result of that inside the table. This way, if the memoized function is called with the same arguments a second time, it can just look up the result in the table.
Next, for defmemo itself, we need to generate code that takes the body as a lambda expression, passes that lambda function through memo, and uses that as the implementation of the function. One way to set the implementation of a function to be a lambda function is to use setf with symbol-function.6 For example, here is how you could set the implementation of square to be a lambda function that squares its argument:
(setf (symbol-function 'square) (lambda (x) (* x x))) (square 5) => 25
Based on the paragraph above, here is the code for defmemo:
(defmacro defmemo (name args &body body) `(setf (symbol-function ',name) (memo (lambda ,args ,@body))))
Now instead of defining a function with defun, we can define it with defmemo and it will automatically be memoized! Defmemo is a great example of how you can define your own ways to define functions. Many libraries provide similar features in which you use the same syntax as defun, only with a bit of magic thrown in.
10 Nov 2015 12:00pm GMT
06 Nov 2015
Clozure CL 1.11 is now available. Please see http://ccl.clozure.com/download.html for instructions on how to get it.
06 Nov 2015 8:34pm GMT
04 Nov 2015
We're happy to announce our second invited speaker for the next European Lisp Symposium (May 9-10 2016, Krakow, Poland). Francis Sergeraert (Institut Fournier, Grenoble, France) will be speaking about lexical closures and complexity. All the details are already on the website...
04 Nov 2015 12:00am GMT
Declt 2.0.1 "Benjamin Sisko" is out. This is a bugfix release with one internal change (a special variable was not following the earmuffs convention) and one actual bugfix (the same Texinfo anchor was generated for symbols with the same name but in different packages).
Find it at the usual place...
04 Nov 2015 12:00am GMT
01 Nov 2015
- cl-itertools - itertools Python lib ported to CL - MIT
- cl-tesseract - CFFI bindings to the Tesseract OCR library. - MIT
- fast-websocket - Optimized WebSocket protocol parser - BSD 2-Clause
- lisp-critic - LISP-CRITIC - A Lisp code critiquing package. - MIT Licence
- redirect-stream - Offers a stream that redirects all actions to an inner stream. - Artistic
- trivial-ssh - An abstraction layer over cl-libssh2. - MIT
Updated projects: 3d-vectors, archive, binfix, buffalo, calispel, ceramic, chirp, city-hash, cl-ana, cl-async, cl-autowrap, cl-charms, cl-gists, cl-glfw3, cl-gobject-introspection, cl-hamt, cl-jpl-util, cl-launch, cl-liballegro, cl-libyaml, cl-locale, cl-marklogic, cl-messagepack, cl-openstack-client, cl-opsresearch, cl-permutation, cl-quickcheck, cl-readline, cl-redis, cl-riff, cl-scripting, cl-smtp, cl-yaml, clack, clavier, cletris, clfswm, climon, clml, closer-mop, clsql-orm, codata-recommended-values, colliflower, com.google.base, commonqt, crane, croatoan, crypto-shortcuts, dexador, dissect, djula, docparser, drakma, drakma-async, eazy-gnuplot, eazy-project, esrap, external-program, fare-csv, fare-memoization, fast-http, frpc, gendl, hh-redblack, interface, introspect-environment, jonathan, json-responses, lack, lake, lift, lisp-gflags, lucerne, mathkit, mcclim, mgl-pax, montezuma, more-conditions, ningle, perlre, plump, postmodern, qt-libs, qtools, quri, rutils, s-http-client, scalpl, scriptl, serapeum, simple-tasks, snappy, staple, stumpwm, swank-crew, sxql, trivial-download, trivial-signal, uiop, unix-options, unix-opts, utm, varjo, websocket-driver, woo, xml-emitter.
Removed projects: teepeedee2. Removed because of the way it clobbers the ASDF configuration to load its own alexandria.
Incidentally, October marks the fifth anniversary of the initial release of Quicklisp. Enjoy!
01 Nov 2015 10:16pm GMT