19 Aug 2014
I counted about sixty people in the hall - really not very many for an ILC where one might typically hope for twice that. I didn't see anyone from the CL vendors, hardly anyone from any of the other CL implementations, few people I knew from previous outings, indeed really not many of the "usual suspects" at all. But note that some of this works two ways: I could equally be upbeat and say there wasn't much overlap between Montreal and the recent ELS in Paris: this conference had netted a different audience, which is fine.
My only fault with the program was that there wasn't enough of it. For an ILC there really should be enough material to fill four days. At two and a half, this gig felt like one of the European local meetings, translocated.
As for the presentations: the one that left me totally gaping in amazement was the lightning talk from a 14 year old (sorry kid, I didn't make a note of your name) who'd implemented his own lisp dialect along with embedded image processing and all manner of other bells and whistles. Of the named speakers, I should mention: Dave Cooper's hands-on Gendl tutorial; Christian Queinnec's Small (but Massive) Open Online Course; François-René Rideau's CL scripting; Dave Pendler's amalgam of LISP 1.5 and APL\360; and Robert Strandh's SICL implementation notes. All that time we spent building bigger and better caches is now just history.
My own (co-)talk about work on an NLP project went OK (I think). Judging by the questions, the audience found Michael Young's social science side much more interesting than my comments about the lisp, and I'm not going to disagree with them.
So for two and a half days, some part of the lisp community had come face-to-face. We presented and listened to the talks, we drank the coffee and ate the food, we explored Montreal from below. And when it was all over we all went away again, back to our separate corners. I think the next "lisp community" meet will be European Lisp Symposium 2015, in London next April / May. I mention this "community" idea, because it came up in the panel on the last day, and it's an interesting thought to mull over: is there one? (or more than one?) should there be one? and if you had it in front of you what would you do with it?
19 Aug 2014 2:05pm GMT
17 Aug 2014
Listora is seeking a Clojure/ClojureScript and Datomic engineer
Reply to Henry dot ec at Gmail dot com
London based start up with highly experienced (and awesome) distributed team of researchers, engineers and designers. Although the project is still in its infancy (very greenfield), we have a big vision of being the global leader in high-quality structured events data, which will help to drive the contextual and personalised event discovery experiences of the future.
We're developing our platform along with applications that meet the needs (jobs/outcomes) of content data partners, event organisers, publishers and developers. It's both a huge CS and product challenge and we need exceptional and ambitious people to solve the problems that we'll come up against.
We've been developing our core platform with Clojure + Datomic/Cassandra and our applications with ClojureScript/Om. We have a very advanced stack that any passionate FP engineer would be excited about working on. We also regularly work in the open source community and are already releasing libraries.
In terms of product, we have an upfront investment in high quality research and concept evaluation taken from years of experience using both customer and outcome driven development methodologies. We are very open in what we do and have a toolset to build our products very closely with our customers. We don't work to deadlines, but we do spend a lot of time thinking about the prioritisation of what we are and could be building.
We are looking for engineers to join our team that can hit the ground running. The right person will be self-motivated, conscientious, humble and a great communicator. We are looking for people with many years FP experience in senior, lead or architectural roles and are used to working in a start up environment with start up tooling.
We want to work with fantastic engineers and passionate people. This is one of the things that motivates us. We don't mind where in the world you're based or whether you want a permanent or contracting relationship.
If you're the right fit with our team we will work hard to look after you - from giving you the flexibility to explore ideas and technologies for the job at hand to remuneration.
If you have any questions, we would love to hear from you.
17 Aug 2014 11:39am GMT
DiligenceEngine, a Toronto-based startup using machine learning to automate legal work, is hiring two Clojure/Clojurescript developers. They say:
We're looking for a developer to work on our clojure/clojurescript/om web stack. Our team is small, pragmatic, and inquisitive; we love learning new technologies and balance adoption with good analysis. We prefer to hire near us, but also welcome remote work in a time zone within North America.
See the full job descriptions on the DiligenceEngine website.
17 Aug 2014 2:57am GMT
This Stack Overflow post points out an obscure and undocumented weakness in Intel's implementation of the POPCNT instruction: although the population count (number of bits equal to 1) is only a function of the source argument, hardware schedules it as though it also depended on the destination. GCC, clang and MSVC all fail to take this issue into account.
Until a new patched version of my favourite C compiler is released, there aren't many tasteful workarounds for this performance bug. I'd have to switch to inline asm, and either force the compiler to allocate the same register to the input and the result, or force different registers and clear the spurious dependency with a xor. Ideally, I wouldn't impose any additional constraint on the register allocator and only insert a xor if the destination and source registers don't match.
SBCL easily supports this use case, without having to re-release or even recompile the implementation: VOPs (virtual operations) execute arbitrary CL code during code generation and they can be defined at runtime.
The first step is to make sure that SBCL's assembler knows how to emit popcnt: the assembler can also be extended at runtime, but that's more hairy and a topic for another post. Instruction encodings are defined in
src/compiler/$ARCH/insts.lisp, and a quick grep reveals
(define-instruction popcnt (segment dst src) ...): the x86-64 backend learned about popcnt in May 2013 (thanks to Douglas Katzman).
We define VOPs via
define-vop, a macro that exposes many options. Most of the time, it's easiest to look at a pre-existing definition for an operation that's similar to the one we want to add. Popcount looks like integer negation: it has a single (machine integer) argument and returns another integer. Integer negation is defined in
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
The code snippet above includes a bit of boilerplate to factor out commonalities via inheritance. The first definition introduces
fast-safe-arith-op, VOPs that apply in both high speed and high safety settings (the rest is copy/pasted noise from earlier ports that sport a scheduler); the second one extends
fast-safe-arith-op to define
fixnum-unop, a base definition for single-argument operations on fixnums, while the third one is the same, but for machine integers. The last three definitions fill in the blanks so the compiler can compile
%negate of fixnum, signed and unsigned integers. The
(:translate %negate) bit means that these VOPs can be emitted instead of calls to
%negate. The integer after
:generator defines the "cost" of each variant; the compiler will choose the (applicable) variant with the least cost and execute the code sequence that follows to convert a call to
%negate into machine code.
This kind of implementation inheritance is fine for an SBCL backend, where we define many VOPs and expect developers to understand the system. I doubt it's a didactic win. Let's do something simpler for
popcnt. In the interest of simplicity, I'll also completely disregard powerful details in
define-vop that are rarely relevant when defining intrinsics that map directly to machine instructions.
First, we need to tell the compiler that we're about to do special things to a function named
popcnt (and to blow away any pre-existing information if the
defknown form is re-evaluated).
1 2 3 4 5 6 7 8 9
This says that
popcnt accepts a 64-bit unsigned integer and returns an integer between 0 and 64 (inclusively), and that the function can be constant-folded, flushed (eliminated as dead code) and moved around (it's pure).
Now, to define a VOP that implements
1 2 3 4 5 6 7 8 9 10 11 12 13
We define a new VOP named
popcnt:popcnt (the name is arbitrary, as long as it doesn't collide with another VOP) that is applicable at all optimization policies (both high speed and high debug level), and that implements
popcnt:popcnt. Its first and only argument,
x, is an
unsigned-num, an unsigned machine integer, that can only be stored in a register. Moreover, if possible, we'd like
x to be allocated the same register as the result,
r. There's only one result (
r) and it's an unsigned machine integer in a register, just like
x. The generator, of cost 3 (a common default for arithmetic operations), breaks any dependency chain in
r if necessary, and stores the population count of
At first sight, the
defknown form seems to conflict with the VOP. We declare that the return value of
popcnt is a small integer, clearly a fixnum, and then define a VOP that returns a machine integer. The subtlety is that
defknown is concerned with IR1, the higher level intermediate representation, which works on CL types (i.e, types as sets) and abstract values. VOPs, on the other hand, are defined for the lower level IR2, where types describe concrete representations (like C). It is perfectly meaningful to say that a small integer will be represented as an untagged machine integer.
The next step isn't strictly necessary, but helps people who like their REPL. The compiler knows how to compile calls to
popcnt, so we can define
popcnt… as a call to
popcnt. Our new function is now a first-class value that can be called from interpreted code and passed to higher-order functions, like the compiler's constant-folding pass.
1 2 3 4
CL-USER> (disassemble 'popcnt:popcnt) ; disassembly for POPCNT:POPCNT ; Size: 25 bytes ; 07FCDB6E: 4831D2 XOR RDX, RDX ; no-arg-parsing entry point ; 71: F3480FB8D1 POPCNT RDX,RCX ; 76: 48D1E2 SHL RDX, 1 ; 79: 488BE5 MOV RSP, RBP ; 7C: F8 CLC ; 7D: 5D POP RBP ; 7E: C3 RET [ error trap noise ] CL-USER> (popcnt:popcnt 42) 3
The disassembly shows that we get the code that we expect, including the dependency-breaking workaround, and the smoke test passes. There's one interesting detail: we only defined a VOP that returns a machine integer. However,
popcnt returns a tagged value (a fixnum), and does so with an efficient shift. IR2 takes care of inserting any coercion needed between VOPs (e.g., between
popcnt and the VOP used to return boxed values from functions), and the IR1
defknown guarantees that the result of
popcnt, despite being represented in an unsigned machine integer, is small enough for a fixnum.
Let's see what happens when we feed arithmetic into
CL-USER> (disassemble (lambda (x y) (declare (type (unsigned-byte 32) x y)) (popcnt:popcnt (+ x y)))) ; disassembly for (LAMBDA (X Y)) ; Size: 55 bytes ; 0752BD59: 4801FA ADD RDX, RDI ; no-arg-parsing entry point ; 5C: 48D1FA SAR RDX, 1 ; 5F: F3480FB8D2 POPCNT RDX,RDX ; 64: 48D1E2 SHL RDX, 1 ; 67: 488BE5 MOV RSP, RBP ; 6A: F8 CLC ; 6B: 5D POP RBP ; 6C: C3 RET
After adding two fixnums, an automatic coercion unboxes the resulting fixnum into a machine integer which is then passed to
popcnt (note the lack of dependency-breaking
xor now that the source and destination are the same register).
That's pretty good code, but we can do better: fixnums are tagged with 0, so we can simply feed fixnums to
popcnt without untagging.
This is where the cost parameter to
:generator comes in: we can define another VOP for
popcnt of fixnums and bias the compiler to prefer the fixnum VOP.
1 2 3 4 5 6 7 8 9 10 11 12 13
CL-USER> (disassemble (lambda (x y) (declare (type (unsigned-byte 32) x y)) (popcnt:popcnt (+ x y)))) ; disassembly for (LAMBDA (X Y)) ; Size: 47 bytes ; 07BEABE9: 4801FA ADD RDX, RDI ; no-arg-parsing entry point ; BEC: F3480FB8D2 POPCNT RDX,RDX ; BF1: 48D1E2 SHL RDX, 1 ; BF4: 488BE5 MOV RSP, RBP ; BF7: F8 CLC ; BF8: 5D POP RBP ; BF9: C3 RET
Unlike many low-level languages, CL includes a standard function for population count,
logcount. SBCL includes a VOP for
logcount (with a cost of 14), which we can supersede with our own
popcnt-based VOPs: we only have to replace
(:translate popcnt:popcnt) with
(:translate logcount). That's an easy improvement but isn't in trunk because
popcnt is a recent x86 extension.
Adding VOPs for (ad-hoc) polymorphic or otherwise generic functions can be surprising: a VOP will only be considered if the arguments and the return values are known to have CL types that are compatible with the VOP's representation specification. For
popcnt, we guarantee that the return value is a positive integer between 0 and 64; for
defknown guarantees that the return value is a positive fixnum. In both cases, the return values can always be represented as an unsigned machine integer, so our new VOPs will always match if the argument fits in positive fixnums or unsigned machine integers (and will have priority over the generic x86-64 VOP because their cost is lower). More complex cases depend on
derive-type optimizers, but that's rarely necessary when defining instrinsics for low-level code.
17 Aug 2014 2:55am GMT
14 Aug 2014
Quantum Computing is a fascinating field, but currently a contentious one. The only examples we have of real-world, hardware quantum computers are the line of adiabatic quantum computers from D-Wave Systems-and many voices in the scientific community still protest its identification as such simply because it is not a full-fledged gate-model quantum computer complete with persistent quantum data storage and QRAM. However, by the strictest definition, any machine which exploits quantum mechanical phenomena for the purpose of computation is a quantum computer, and the D-Wave One and Two meet this definition.
For us in the Lisp community, Quantum Computing is even more important; one of the most surprising secrets of the D-Wave line of adiabatic quantum computers is that their low-level operating system is programmed in Common Lisp. Specifically, D-Wave uses SBCL. This choice is not accidental or arbitrary-Common Lisp is uniquely suited to the task of quantum computer programming.
Lambda Calculus, Functional Programming, and Quantum Computing
Peter Selinger, a Professor of Mathematics at Dalhousie University, has been publishing papers regarding Quantum Computer programming languages and Lambda Calculus for a number of years, extending the work of van Tonder and others. In particular, he formalized his argument in favor of Lambda Calculus as a natural expression of quantum algorithms for classically-controlled quantum computers in the following papers:
- A lambda calculus for quantum computation with classical control, 2006
- Quantum lambda calculus, 2009
Selinger made one mistake, in my opinion: he chose Haskell to implement his quantum computer simulator and quantum computer programming language, Quipper. While he could have chosen Scheme without any complaint, he should have chosen Common Lisp, just like D-Wave-and to prove this point, I took my work on a Quantum Common Lisp I originally designed for my science-fiction novel and started the BLACK-STONE open-source project to create a faster quantum computer simulator and programming environment, in less lines of code. While the BLACK-STONE project is still in its infancy (and I haven't had the time to work on it at all in recent months), the simplicity, elegance, clarity, and succinctness of the Common Lisp source compared to Haskell is already heavily apparent.
D-Wave One and Two are rudimentary Quantum Lisp Machines
I was surprised and pleased to discover in my personal conversations with the lead physicists, engineers, and developers at D-Wave, that they had come to the same conclusion as me, but had the means to act on those insights. They had already accomplished what I set out to do with Quantum Common Lisp and BLACK-STONE; unfortunately, their source code is very much proprietary, and even as a registered D-Wave developer, I had no access to their Lisp operating system for the D-Wave One and Two.
All I can do, as an outsider looking in, is make certain informed assumptions. I could easily be completely wrong, but it seems to me fairly self-evident that the D-Wave platforms qualify as rudimentary prototype Quantum Lisp Machines. The D-Wave hardware was clearly designed with Lisp in mind, and the low-level Operating System of the D-Wave hardware is written entirely in Common Lisp. Personally, I don't think it would have even been possible for them to build a quantum computer at all without Lisp.
As I understand the integration model, the D-Wave platforms rely on classical supercomputers for their interface, memory, and persistent storage-so that limits the types of quantum algorithms that can be run to those that must accept from and return measured, classical data to the interface; algorithms which expect unmeasured, raw quantum data for input, and persistence of such data existing unmeasured in memory are not currently possible.
Still, any quantum computer at all is an amazing feat-and certainly D-Wave's customer base seems quite happy with the platform. They are purpose built for a specific class of optimization problems, which classical computers have a great deal of difficulty solving.
Regarding their former developer program, which has now been closed for some time, it is interesting to ponder why-when the core platform is 100% Lisp-they would only release a Python Pack as their developer API and not a Lisp interface. I suspect it came down to the matter of perceived value and a hope for rapid adoption: they put a lot of time and effort into writing a quantum computer simulator for Python, and a number of excellent tutorials to go with it, but still they did not get the adoption they hoped for-and as a result, they closed the developer program and scrapped the Quantum Cloud platform that I was relying on and waiting for, to finish some of my more interesting (to me at least), software projects.
The problem with the choice of Python, however, is that Quantum Computer programming only actually makes sense in Lisp, and as a result only Lisp Hackers can understand Quantum Computer Programming. Other programming languages only serve to conflate the simple elegance of quantum energy programming into a terrifying, obtuse, impenetrable subject, while in Lisp it is expressed so naturally and intuitively-it's almost as if Lisp itself was purpose designed for quantum computing.
You can see this for yourself in my project SILVER-SWORD, a Common Lisp bridge to the D-Wave Python Pack. Unfortunately, as I've said before, you can't actually use the library since D-Wave no longer distributes their Python Pack, but you can view the translated tutorial examples in the repo and see for yourself-quantum energy programming is really easy in Lisp. And SILVER-SWORD would be even better if the underlying Lisp-Python bridge, BURGLED-BATTERIES, supported Python Class to CLOS translation.
Cons-cells, qubits, chimera-graphs, and the human brain
The reason I say that only Lispers can understand Quantum Computer programming is based on the observation of the fundamental structural similarities between a generalized model Lisp program, the D-Wave hardware, and the human brain. From an experiential perspective, learning and using one of the many Lisp-family languages offers the individual programmer specific insight into the functioning of their own internal platform.
Consider a general, recursive function in Common Lisp. Its internal representation within the Lisp run-time is an acyclic graph of cons-cells, pairs of pointers that map to memory or other cons-cells. Every Lisper eventually learns to think in terms of cons-cells, and is concerned about the efficiency of their code in terms of tree-walking-which path through all possible cons-cells to the desired return value is the shortest?-and, which path reduces the number of conses to the absolute minimum?
Now consider a learned skill, such as the art of programming; it can be a lengthy process for some, a seemingly effortless task for others. Many people brush this off as a simple matter of intelligence-but in fact, it is due to another learned skill: the art of learning itself, whereby the individual learns to integrate and categorize new information more efficiently than others, thus appearing smarter. Technically speaking, as we have been learning from the field of neuroscience, all human beings are born with roughly equal potential, but they are shaped and molded into individuals by their experiences and environment; in short, since so few of us know any better, we allow ourselves and our children to be automatically programmed by effectively random, chaotic, and unpredictable perceptions, instead of learning and teaching a methodology of self-mastery where the individual programs themselves. At the moment, intelligence and an aptitude for the sciences, engineering, mathematics and programming in individuals is more or less an accident, much like all talents; but in the near future, all talents could easily be engineered by the individual into themselves.
Then consider the concept of Quantum Computer energy programming, introduced by D-Wave as the programming paradigm to support their novel adiabatic quantum computer hardware. Because the underlying flux-qubits of the D-Wave processors exist in superposition until they are measured, they effectively process every probable result of a program simultaneously. The most efficient answer, the lowest energy solution, is returned first; even though all other results returned are also correct, from a traditional standpoint, the lowest energy solution is the most correct, even when other results have a higher probability. This point corresponds to both the question of intelligence, and the most efficient recursive function in Lisp, above. The most obvious path is not always the most correct, but a strong case can always be made to favor efficiency over all other factors.
These three points correspond to a psychological/computational Occam's Razor-when faced with multiple solutions of high probability, the most efficient solution is the most correct.
Alongside a cursory study of neuroscience and theoretical physics, you can begin to see the similarities between the structure of Lisp programs, the implementation of the D-Wave quantum annealing processor, and the structure of the human brain itself.
As I have said before, Lisp is the language of the Universe, the Voice and Will of the True Self. This is simply a poetic expression of super-symmetry, and how it relates to the concept of Grokking in the Lisp community. The point is, the acyclic graph structure of cons-cells allows for a natural description of the fundamental (and inherently quantum) phenomenon of the physical universe, as well as of the humain brain itself.
Thus, when programmed in Lisp, a sufficiently complex neural-net application running natively on D-Wave style hardware of equal complexity and efficiency as the human brain is capable of emergent machine intelligence. The less it is programmed to behave intelligently, the better-it only requires a diverse selection of sensory input to perceive its external environment, a means to differentiate the internal from the external, and a core low-level operating system to manage its internal state. Experiments at D-Wave have already confirmed this.
One might even be so bold as to say that the super-symmetry is so perfectly expressed, that we human beings are naturally-occurring, organic quantum lisp machines. And this point holds whether you are an evolutionist, creationist, or otherwise.
The Future of Computing and Lisp
If quantum computing is the future of computer science as a whole, then by extension, so is Lisp. As Lisp Hackers, we have a responsibility to push computing to its limits, remind the world that Common Lisp is still the most advanced and powerful programming language, and ensure that the next generation of consumer-grade computers are all-every single one of them-Quantum Lisp Machines.
The inherent super-symmetry of Lisp programs, the central nervous system, and the physical universe is perhaps the strongest argument in favor of this. After all, human evolution has been marked out since the beginning of history not by advantageous genetic mutation, but by technological innovation-and what tool is more powerful than a language which can naturally express the fundamental laws of physics, the underlying structure of the human brain, and the ultimate model of computation?
Together, Quantum Computing and Lisp can help us better understand ourselves, our true nature, the universe we live in, and the limitless potential of our species-that the combination will almost certainly lead to emergent machine intelligence and the technological singularity is pure gravy, after that. Some days it is truly staggering to live in such a time in human history, living with the knowledge that we Lisp Hackers hold the future of the entire human race in our hands, at our keyboards.
The next time you fire up Emacs and type M-x slime, remember this. Hack like the whole world is counting on you and every line of code you write; hack like you're channeling the Will of the universe itself. Because the future is here, now-the future is Quantum Common Lisp.
14 Aug 2014 6:17pm GMT
Thanks go out to Orivej Desh-he pointed out what I overlooked, and the LET-OVER-LAMBDA package now works with full functionality restored in SBCL 1.2.2. I have preserved the
:safe-sbcl feature used for testing the version of SBCL, so that descending into
sb-impl::comma is only enabled for SBCL 1.2.2+.
The updated code should be available in the August release of Quicklisp. If you have a fork or clone, please be certain to pull the latest changes from the master branch.
14 Aug 2014 8:48am GMT
12 Aug 2014
As expected, the Quicklisp distribution of LET-OVER-LAMBDA is broken by the changes to the backquote reader macro in SBCL 1.2.2; although I expect this change breaks a good portion of Paul Graham's macro code examples from On Lisp, as well.
A quick-fix suggested on Reddit is to use a "pseudo-flatten" for SBCL that also descends into
sb-impl::comma. I will be testing this fix today, and hopefully pushing an updated version for the August release of Quicklisp.
LOL:FLATTEN to descend into
sb-impl::comma objects, but no joy. I have currently disabled
DEFMACRO! based code in LET-OVER-LAMBDA until I can find a better solution, and tested this against both v1.2.2 and v1.2.0-1 of SBCL (so if nothing else, it will at least build without errors).
If anyone knows of a better solution, feel free to leave a comment here or on the GitHub Issue thread.
12 Aug 2014 4:25pm GMT
08 Aug 2014
I'm computing the set cover by incrementally adding items to be covered. Naturally, the order in which you add items changes the way the program progresses. I added code that picks an item to be added each iteration rather than just pulling the
car off the front of a list.
(define (cover8 value->keys-table better-solution) (define (add-v-k-entry solution-set v-k-entry) (let ((value (car v-k-entry)) (keys (cdr v-k-entry))) (write-string "Adding value ") (write value) (newline) (write-string " with keys ") (write keys) (newline) (write-string " to ") (write (length solution-set)) (write-string " partial solutions.") (newline) (let ((new-solutions (map make-new-solution (cartesian-product solution-set keys)))) (let ((trimmed-solutions (trim-partial-solutions value->keys-table new-solutions))) (write-string "Returning ") (write (length trimmed-solutions)) (write-string " of ") (write (length new-solutions)) (write-string " new partial solutions.") (newline) trimmed-solutions)))) (define (cover v-k-entries) (cond ((pair? v-k-entries) (pick-v-k-entry value->keys-table v-k-entries (lambda (selected remaining) (add-v-k-entry (cover remaining) selected)))) ((null? v-k-entries) (list '())) (else (improper-list-error 'cover v-k-entries)))) (let ((minimized (minimize-vktable value->keys-table better-solution))) (least-elements (cover minimized) better-solution))) (define (pick-v-k-entry value->keys-table v-k-entries receiver) (define (score v-k-entry) (let* ((matched-all (count-matching-items value->keys-table (lambda (other) (there-exists? (cdr v-k-entry) (lambda (key) (member key (cdr other))))))) (matched-remaining (count-matching-items v-k-entries (lambda (other) (there-exists? (cdr v-k-entry) (lambda (key) (member key (cdr other))))))) (matched-forward (- matched-all matched-remaining))) (cons matched-remaining matched-forward))) (let ((scored (map (lambda (v-k-entry) (cons (score v-k-entry) v-k-entry)) v-k-entries))) (let ((picked (cdar (least-elements scored (lambda (left right) (let* ((len-l (length (cdr left))) (len-r (length (cdr right))) (lmr (caar left)) (lmf (cdar left)) (rmr (caar right)) (rmf (cdar right))) (or (> len-l len-r) (and (= len-l len-r) (or (> lmf rmf) (and (= lmf rmf) (< lmr rmr))))) )))))) (display "Picking ") (write picked) (newline) (receiver picked (delete picked v-k-entries))))) (define (trim-partial-solutions value->keys-table partial-solutions) (let ((equivalent-solutions (map (lambda (entry) (cons (cdr entry) (car entry))) (collect-equivalent-partial-solutions value->keys-table partial-solutions)))) (write-string " Deleting ") (write (- (length partial-solutions) (length equivalent-solutions))) (write-string " equivalent partial solutions.") (newline) (remove-dominated-solutions value->keys-table (map lowest-scoring-equivalent-partial-solution equivalent-solutions))))
Finally, it turns out that computing dominating partial solutions is expensive, so I changed the set operations to use a bitmap representation:
(define (remove-dominated-solutions value->keys-table partial-solutions) (let ((before-length (length partial-solutions)) (all-values (get-values value->keys-table))) (let ((table ;; put the long ones in first (sort (map (lambda (partial-solution) (cons partial-solution (lset->bset all-values (map car (partial-solution-matches value->keys-table partial-solution))))) partial-solutions) (lambda (left right) (> (length (bset->lset all-values (cdr left))) (length (bset->lset all-values (cdr right)))))))) (let ((answer (map car (fold-left (lambda (answer solution) (if (there-exists? answer (dominates-solution? solution)) answer (cons solution answer))) '() table)))) (let ((after-length (length answer))) (write-string " Removing ") (write (- before-length after-length)) (write-string " dominated solutions.") (newline) answer))))) (define (dominates-solution? solution) (let* ((partial-solution (car solution)) (partial-solution-score (score partial-solution)) (solution-matches-raw (cdr solution))) (lambda (other-solution) (let* ((other-partial-solution (car other-solution)) (other-matches-raw (cdr other-solution))) (and (bset-superset? other-matches-raw solution-matches-raw) (<= (score other-partial-solution) partial-solution-score)))))) (define (get-values v-k-table) (fold-left (lambda (answer entry) (lset-adjoin equal? answer (car entry))) '() v-k-table)) (define (bset-element->bit universe element) (cond ((null? element) 0) (else (expt 2 (list-index (lambda (item) (eq? item element)) universe))))) (define (bset-adjoin universe bset element) (bset-union bset (bset-element->bit universe element))) (define (lset->bset universe lset) (fold-left (lambda (answer element) (bset-adjoin universe answer element)) 0 lset)) (define (bset->lset universe bset) (cond ((zero? bset) '()) ((even? bset) (bset->lset (cdr universe) (/ bset 2))) (else (cons (car universe) (bset->lset (cdr universe) (/ (- bset 1) 2)))))) (define (bset-union left right) (bitwise-ior left right)) (define (bset-superset? bigger smaller) ;; Is every element of smaller in bigger? (zero? (bitwise-andc2 smaller bigger)))
This code can now find the shortest regular expression consisting of letters and dots (and ^$) that matches one set of strings but not another.
Depending on the strings, this can take quite a bit of time to run. Dotted expressions cause a combinatorical explosion in matching regexps (or substrings), but what makes it worse is that the dotted expressions tend to span different sets of strings. If two different dotted expressions, each with different matching sets of strings, appear in a single string, then the number of partial solutions will be multiplied by two as we try each different dotted expression.
It is characteristic of NP problems that it is easy to determine if you have a good solution, but quite hard to find it among a huge number of other, poor solutions. This problem exhibits this characteristic, but there is a bit more structure in the problem that we are exploiting. The word lists are drawn from the English language. This makes some bigrams, trigrams, etc. far, far, more likely to appear than others.
Short words are much easier to process than longer ones because they simply contain fewer things to match. On the other hand, longer words tend to be dominated by shorter ones anyway.
To be continued...
08 Aug 2014 10:56pm GMT
07 Aug 2014
It wasn't too hard to add regular expressions to the substring version. What took a while was just tinkering with the code, breaking it, fixing it again, noticing an optimization, tinkering, etc. etc. In any case it works and here is some of it.
(define (make-extended-ngram-table winners losers) (let* ((initial-ngrams (generate-ngrams winners losers))) (write-string "Initial ngrams: ") (write (length initial-ngrams)) (newline) (map (lambda (winner) (cons winner (keep-matching-items initial-ngrams (lambda (ngram) (re-string-search-forward ngram winner))))) winners))) (define (generate-ngrams winners losers) (write-string "Generating ngrams...")(newline) (let ((losing-ngram? (string-list-matcher losers))) (fold-left (lambda (answer winner) (lset-union equal? answer (extended-ngrams losing-ngram? winner))) '() winners))) (define (string-list-matcher string-list) (lambda (test-ngram) (there-exists? string-list (lambda (string) (re-string-search-forward test-ngram string))))) (define *dotification-limit* 4) (define *generate-ends-of-words* #t) (define *generate-dotted* #t) (define (ngrams-of-length n string) (do ((start 0 (1+ start)) (end n (1+ end)) (answer '() (lset-adjoin string=? answer (substring string start end)))) ((> end (string-length string)) answer))) (define (generate-dotted answer losing-ngram?) (do ((tail answer (cdr tail)) (answer '() (let ((item (car tail))) (fold-left (lambda (answer dotted) (if (losing-ngram? dotted) answer (lset-adjoin string=? answer dotted))) answer (dotify item))))) ((not (pair? tail)) (if (null? tail) answer (improper-list-error 'generate-dotted tail))))) (define (dotify word) (cond ((string=? word "") (list "")) ((> (string-length word) *dotification-limit*) (list word)) (else (fold-left (lambda (answer dotified) (fold-left (lambda (answer replacement) (lset-adjoin equal? answer (string-append replacement dotified))) answer (replacements (substring word 0 1)))) '() (dotify (substring word 1 (string-length word))))))) (define (replacements string) (if (or (string=? string "^") (string=? string "$")) (list string) (list string "."))) (define (extended-ngrams losing-ngram? string) (let ((string (if *generate-ends-of-words* (string-append "^" string "$") string))) (do ((n 1 (+ n 1)) (answer '() (lset-union string=? answer (delete-matching-items (ngrams-of-length n string) losing-ngram?)))) ((> n (string-length string)) (if *generate-dotted* (generate-dotted answer losing-ngram?) answer)))))
Adding the dotification greatly increases the number of ways to match words:
1 ]=> (extended-ngrams (string-list-matcher losers) "lincoln") ;Value 15: ("li" "ln" "ln$" "oln" ".ln" "col" "lin" "li." "^li" "o.n$" "oln$" ".ln$" "col." "c.ln" "..ln" "coln" ".oln" "co.n" "n.ol" "..ol" "ncol" ".col" "nc.l" "i.co" "inco" "i..o" "in.o" "lin." "li.." "l.nc" "linc" "l..c" "li.c" "^li." "^lin" "coln$" "ncoln" "incol" "linco" "^linc" "ncoln$" "incoln" "lincol" "^linco" "incoln$" "lincoln" "^lincol" "lincoln$" "^lincoln" "^lincoln$")
The table that maps words to their extended ngrams is quite large, but it can be reduced in size without affecting the solution to the set cover problem. If two regexps match exactly the same set of winning strings, then one can be substituted for the other in any solution, so we can discard all but the shortest of these. If a regexp matches a proper superset of another regexp, and the other regexp is at least the same length or longer, then the first regexp dominates the second one, so we can discard the second one.
(define (minimize-keys value->keys-table better-solution) (let* ((all-keys (get-keys value->keys-table)) (equivalents (collect-equivalent-partial-solutions value->keys-table (map list all-keys))) (reduced (map (lambda (equivalent) (cons (car equivalent) (car (least-elements (cdr equivalent) better-solution)))) equivalents)) (dominants (collect-dominant-partial-solutions reduced better-solution)) (good-keys (fold-left (lambda (answer candidate) (lset-adjoin equal? answer (cadr candidate))) '() dominants))) (define (rebuild-entry entry) (cons (car entry) (keep-matching-items (cdr entry) (lambda (item) (member item good-keys))))) (write-string "Deleting ") (write (- (length all-keys) (length good-keys))) (write-string " of ") (write (length all-keys)) (write-string " keys. ") (write (length good-keys)) (write-string " keys remain.")(newline) (map rebuild-entry value->keys-table))) (define (partial-solution-matches value->keys-table partial-solution) (keep-matching-items value->keys-table (lambda (entry) (there-exists? partial-solution (lambda (key) (member key (cdr entry))))))) (define (collect-equivalent-partial-solutions value->keys-table partial-solutions) (let ((answer-table (make-equal-hash-table))) (for-each (lambda (partial-solution) (hash-table/modify! answer-table (map car (partial-solution-matches value->keys-table partial-solution)) (list) (lambda (other) (lset-adjoin equal? other partial-solution)))) partial-solutions) (hash-table->alist answer-table))) (define (collect-dominant-partial-solutions equivalents better-solution) (define (dominates? left right) (and (superset? (car left) (car right)) (not (better-solution (cdr right) (cdr left))))) (let ((sorted (sort equivalents (lambda (l r) (> (length (car l)) (length (car r))))))) (fold-left (lambda (answer candidate) (if (there-exists? answer (lambda (a) (dominates? a candidate))) answer (lset-adjoin equal? answer candidate))) '() sorted)))
We can minimize the value->key-table in another way. If two values in the table are matched by the exact same set of keys, then we can delete one without changing the solution. If a value is matched by a small set of keys, and if another values is matched by a superset of these keys, then we can delete the larger one because if the smaller one matches, the larger one must match as well.
(define (minimize-values v-k-table) (let ((size-before (length v-k-table))) (define (dominated-value? entry) (let ((entry-value (car entry)) (entry-keylist (cdr entry))) (there-exists? v-k-table (lambda (other-entry) (and (not (eq? entry other-entry)) (let ((other-value (car other-entry)) (other-keylist (cdr other-entry))) (let ((result (and (superset? entry-keylist other-keylist) (not (superset? other-keylist entry-keylist))))) (if result (begin (display "Removing ") (write entry-value) (display " dominated by ") (write other-value) (display ".") (newline) )) result))))))) (define (equivalent-value-in-answer? answer entry) (let ((entry-value (car entry)) (entry-keylist (cdr entry))) (there-exists? answer (lambda (other-entry) (let ((other-value (car other-entry)) (other-keylist (cdr other-entry))) (let ((result (equal? entry-keylist other-keylist))) (if result (begin (display "Removing ") (write entry-value) (display " equivalent to ") (write other-value) (display ".") (newline) )) result)))))) (define (add-entry answer entry) (if (or (equivalent-value-in-answer? answer entry) (dominated-value? entry)) answer (cons entry answer))) (let ((answer (fold-left add-entry '() v-k-table))) (write-string "Removed ") (write (- size-before (length answer))) (write-string " dominated and equivalent values.") (newline) answer)))
Each time we remove values or keys, we might make more keys and values equivalent or dominated, so we iterate until we can no longer remove anything.
(define (minimize-vktable value->keys-table better-solution) (let* ((before-size (fold-left + 0 (map length value->keys-table))) (new-table (minimize-values (minimize-keys value->keys-table better-solution))) (after-size (fold-left + 0 (map length new-table)))) (if (= before-size after-size) value->keys-table (minimize-vktable new-table better-solution))))
The minimized table for the presidents looks like this:
(("washington" "sh" "g..n" "n..o" ".h.n" "a..i") ("adams" "a.a" "am" "ad") ("madison" "m..i" "i..n" "is." "i.o" "di" "ma" "ad") ("monroe" "r.e$" "oe") ("van-buren" "u..n" "r.n" ".b" "bu" "-") ("harrison" "r..s" "r.i" "i..n" "is." "i.o" "a..i") ("polk" "po") ("taylor" "ay." "ta") ("pierce" "ie." "rc" "r.e$") ("buchanan" "bu" "a.a" ".h.n") ("lincoln" "i..o" "li") ("grant" "an.$" "a.t" "ra" "r.n" "g..n") ("hayes" "h..e" "ye" "ay.") ("garfield" "el.$" "i.l" "ga" "ie." "r.i" ".f" "a..i") ("cleveland" "v.l" "an.$") ("mckinley" "n.e" "nl" "i.l" "m..i") ("roosevelt" ".se" "oo" "v.l" "el.$" "r..s") ("taft" "a.t" "ta" ".f") ("wilson" "ls" "i..o") ("harding" "r.i" "di" "a..i") ("coolidge" "oo" "li") ("hoover" "ho" "oo") ("truman" "u..n" "ma") ("eisenhower" "ho" ".se" "h..e" "i..n" "is.") ("kennedy" "nn" "n.e") ("johnson" "j") ("nixon" "^n" "i..n" "i.o" "n..o") ("carter" "rt" "a.t") ("reagan" "ga" "a.a") ("bush" "bu" "sh") ("obama" ".b" "ma" "a.a" "am"))
As you can see, we have reduced the original 2091 matching regexps to fifty.
Changes to the set-cover code coming soon....
07 Aug 2014 5:43pm GMT
06 Aug 2014
I'm arriving on Tuesday. Will anyone else be around before the lisp conference starts, and are they interested in a little touristing / dining experiences?
06 Aug 2014 6:35am GMT