14 Dec 2019
Planet Lisp
Zach Beane: Exclusive file open
The Planet Lisp twitter fix involved tracking post status with a file. Although it's not 100% bulletproof, there's a trick when using files to reduce races.
Here's a naive approach:
 Assign a Planet Lisp post a unique pathname
 If that pathname exists, do nothing
 Otherwise, send a tweet for the post and save its details to the pathname
The problem is between 2 and 3  another process could post the tweet first.
There's another option:
 Assign a Planet Lisp post a unique pathname
 Attempt to open the file with
O_EXCLO_CREAT
mode  If the attempt succeeds, send a tweet for the post and save its details to the pathname
The key bit is O_EXCLO_CREAT
mode, a Unix option that means open
will fail if the file exists, and succeed atomically otherwise. Specifically:
The check for the existence of the file and the creation of the file if it does not exist shall be atomic with respect to other threads executing open() naming the same filename in the same directory with O_EXCL and O_CREAT set.
In SBCL, you can get O_EXCLO_CREAT
semantics by specifing :direction :output
and NIL
, :error
, or :newversion
as the :ifexists
argument to CL:OPEN. For NIL
, you can check to see if the returned value is NIL, instead of a stream, to know if you have successfully exclusively opened the file and can proceed. I used the NIL approach in the Planet Lisp twitter gateway.
14 Dec 2019 1:18pm GMT
Zach Beane: Planet Lisp is back on Twitter
I brought the @planet_lisp Twitter account back to life after more than a year of dormancy. It updates twitter with every item on Planet Lisp.
It broke a while back because of how I used Twitter itself to keep track of what had been posted. I updated it to use a local data directory instead. It's working well so far.
14 Dec 2019 2:59am GMT
13 Dec 2019
Planet Lisp
Vsevolod Dyomkin: Programming Algorithms: Dynamic Programming
This chapter opens the final part of the book entitled "Selected Algorithms". In it, we're going to apply the knowledge from the previous chapters in analyzing a selection of important problems that are mostly applicationindependent and find usages in many applied domains: optimization, synchronization, compression, and similar.
We will start with a single approach that is arguably the most powerful algorithmic technic in use. If we managed to reduce the problem to Dynamic Programming (DP), in most of the cases, we can consider it solved. The fact that we progressed so far in this book without mentioning DP is quite amazing. Actually, we could have already talked about it several times, especially in the previous chapter on strings, but I wanted to contain this topic to its own chapter so deliberately didn't start the exposition earlier. Indeed, strings are one of the domains where dynamic programming is used quite heavily, but the technic finds application in almost every area.
Also, DP is one of the first marketing terms in CS. When Bellman had invented, he wanted to use the then hyped term "programming" to promote his idea. This has, probably, caused more confusion over the years than benefit. In fact, a good although unsexy name for this technic сould be simply "filling the table" as the essence of the approach is an exhaustive evaluating of all variants with memoization of partial results (in a table) to avoid repetition of redundant computations. Obviously, it will have any benefits only when there are redundant computations, which is not the case, for example, with combinatorial optimization. To determine if a problem may be solved with DP we need to validate that it has the optimal substructure property:
A problem has optimal substructure if when we take its subproblem an optimal solution to the whole problem includes an optimal solution to this subproblem.
An example of the optimal substructure is the shortest path problem. If the shortest path from point A to point B passes through some point C and there are multiple paths from C to B, the one included in the shortest path AB should be the shortest of them. In fact, the shortest path is an archetypical DP problem which we'll discuss later in this chapter. A counterexample is a Travelling Salesman Problem (TSP): if it had optimal substructure the subpath between any two nodes in the result path should have been the shortest possible path between these nodes. But it isn't true for all nodes because it can't be guaranteed that the edges of the path will form a cycle with all the other shortest paths.
Fibonacci Numbers
So, as we said, the essence of DP is filling a table. This table, though, may have a different number of dimensions for different problems. Let's start with a 1d case. What book on algorithms can omit discussing the Fibonacci numbers? Usually, they are used to illustrate recursion, yet they are also a great showcase for the power of memoization. Besides, recursion is, conceptually, also an integral part of DP.
A naive approach to calculating the i
th number will be directly coding the Fibonacci formula:
(defun naivefib (i)
(assert (typep i '(integer 0)))
(if (< i 2) 1
(+ (naivefib ( i 1))
(naivefib ( i 2)))))
However, applying it will result in an exponential growth of the number of computations: each call to naivefib
results in two more calls. So, the number of calls needed for the n
th number, with this approach, is O(2^n)
.
> (time (naivefib 40))
Evaluation took: 3.390 seconds of real time
165580141
> (time (naivefib 42))
Evaluation took: 7.827 seconds of real time
433494437
Yet, we can see here a direct manifestation of an optimal substructure property: the i
th number calculation uses the result of the (1 i)
th one. To utilize this recurrence, we'll need to store the previous results and reuse them. It may be achieved by changing the function call to the table access. Actually, from the point of view of math, tables and functions are, basically, the same thing.
(let ((fib (vec 1 1))) ; our table will be an adjustable vector
(defun fib (i)
(when (< (length fib) i)
(vectorpushextend (fib ( i 1)) fib))
(+ (? fib ( i 1))
(? fib ( i 2)))))
What we've done here is added a layer of memoization to our function that uses an array fib
that is filled with the consecutive Fibonacci numbers. The array is hidden inside the closure of the fib
procedure, so it will persist between the calls to it and accumulate the numbers as they are requested. There will also be no way to clear it, apart from redefining the function, as the closed over variables of this kind are not accessible outside of the function. The consecutive property is ensured by the arrangement of the recursive calls: the table is filled on the recursive ascent starting from the lowest yet unknown number. This approach guarantees that each Fibonacci number is calculated exactly once and reduces our dreaded O(2^n)
running time to a mere O(n)
!
Such a calculation is the simplest example of topdown DP that is performed using recursion. Despite its natural elegance, it suffers from a minor problem that may turn significant, in some cases: extra space consumption by each recursive call. It's not only O(n)
in time, but also in space. The alternative strategy that gets rid of redundant space usage is called bottomup DP and is based on loops instead of recursion. Switching to it is quite trivial, in this case:
(let ((fib (vec 1 1)))
(defun bottomupfib (i)
(let ((off (length fib)))
(adjustarray fib (1+ i) :fillpointer t)
(dotimes (j ( (1+ i) off))
(let ((j (+ j off)))
(:= (aref fib j)
(+ (aref fib ( j 1))
(aref fib ( j 2)))))))
(aref fib i)))
> (time (bottomupfib 42))
Evaluation took: 0.000 seconds of real time
> (time (bottomupfib 4200))
Evaluation took: 0.004 seconds of real time
40512746637826407679504078155833145442086707013857032517543... ; this number is a Lisp's bignum that has unlimited size
Funny enough, a realwordready implementation of Fibonacci numbers ends up not using recursion at all...
String Segmentation
Let's consider another 1d problem: suppose we have a dictionary of words and a string consisting of those words that somehow lost the spaces between them  the words got glued together. We need to restore the original string with spaces or, to phrase it differently, split the string into words. This is one of the instances of string segmentation problems, and if you're wondering how and where such a situation could occur for real, consider Chinese text that doesn't have to contain spaces. Every Chinese language processing system needs to solve a similar task.
Here's an example input[1]:
String: thisisatest
Dictionary: a, i, s, at, is, hi, ate, his, sat, test, this
Expected output: this is a test
It is clear that even with such a small dictionary there are multiple ways we could segment the string. The straightforward and naive approach is to use a greedy algorithm. For instance, a shortestfirst solution will try to find the shortest word from the dictionary starting at the current position and then split it (as a prefix) from the string. It will result in the following split: this i sat est
. But the last part est
isn't in the dictionary, so the algorithm has failed to produce some of the possible correct splits (although, by chance, if the initial conditions where different, it could have succeeded). Another version  the longestfirst approach  could look for the longest words instead of the shortest. This would result in: this is ate st
. Once again the final token is not a word. It is pretty obvious that these simple takes are not correct and we need a more nuanced solution.
As a common next step in developing such brute force approaches a developer would resort to backtracking: when the computation reaches the position in the string, from which no word in the dictionary may be recovered, it unwinds to the position of the previous successful split and tries a different word. This procedure may have to return multiple steps back  possibly to the very beginning. As a result, in the worst case, to find a correct split, we may need to exhaustively try all possible combinations of words that fit into the string.
Here's an illustration of the recursive shortestfirst greedy algorithm operation:
(defun shortestfirstrestorespaces (dict str)
(dotimes (i (length str))
(let ((word (slice str 0 (1+ i))))
(when (? dict word)
(returnfrom shortestfirstrestorespaces
(condit
((= (1+ i) (length str))
word)
((shortestfirstrestorespaces dict (slice str (1+ i)))
(format nil "~A ~A" word it))))))))
CLUSER> (defparameter *dict* (hashset 'equal "a" "i" "at" "is" "hi" "ate" "his" "sat" "test" "this"))
CLUSER> (shortestfirstrestorespaces *dict* "thisisatest")
0: (SHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")
1: (SHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")
2: (SHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")
3: (SHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")
3: SHORTESTFIRSTRESTORESPACES returned NIL
2: SHORTESTFIRSTRESTORESPACES returned NIL
1: SHORTESTFIRSTRESTORESPACES returned NIL
0: SHORTESTFIRSTRESTORESPACES returned NIL
NIL
To add backtracking into the picture, we need to avoid returning in the case of the failure of the recursive call:
(defun btshortestfirstrestorespaces (dict str)
(dotimes (i (length str))
(let ((word (slice str 0 (1+ i))))
(when (in# word dict)
(when (= (1+ i) (length str))
(returnfrom btshortestfirstrestorespaces word))
(whenit (btshortestfirstrestorespaces dict (slice str (1+ i)))
(returnfrom btshortestfirstrestorespaces (format nil "~A ~A" word it)))))))
CLUSER> (btbestfirstrestorespaces *dict* "thisisatest")
0: (BTSHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")
1: (BTSHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")
2: (BTSHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")
3: (BTSHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")
3: BTSHORTESTFIRSTRESTORESPACES returned NIL
2: BTSHORTESTFIRSTRESTORESPACES returned NIL
;; backtracking kicks in here
2: (BTSHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "atest")
3: (BTSHORTESTFIRSTRESTORESPACES #<HASHTABLE :TEST EQUAL :COUNT 10 {101B093953}> "test")
3: BTSHORTESTFIRSTRESTORESPACES returned "test"
2: BTSHORTESTFIRSTRESTORESPACES returned "a test"
1: BTSHORTESTFIRSTRESTORESPACES returned "is a test"
0: BTSHORTESTFIRSTRESTORESPACES returned "this is a test"
"this is a test"
Lisp trace
is an invaluable tool to understand the behavior of recursive functions. Unfortunately, it doesn't work for loops, with which one has to resort to debug printing.
Realizing that this is brute force, we could just as well use another approach: generate all combinations of words from the dictionary of the total number of characters (n
) and choose the ones that match the current string. The exact complexity of this scheme is O(2^n)
[2]. In other words, our solution leads to a combinatorial explosion in the number of possible variants  a clear nogo for every algorithmic developer.
So, we need to come up with something different, and, as you might have guessed, DP fits in perfectly as the problem has the optimal substructure: a complete word in the substring of the string remains a complete word in the whole string as well. Based on this understanding, let's reframe the task in a way that lends itself to DP better: find each character in the string that ends a complete word so that all the words combined cover the whole string and do not intersect[3].
Here is an implementation of the DPbased procedure. Apart from calculating the maximum length of a word in the dictionary, which usually may be done offline, it requires single forward and backward passes. The forward pass is a linear scan of the string that at each character tries to find all the words starting at it and matching the string. The complexity of this pass is O(n * w)
, where w
is the constant length of the longest word in the dictionary, i.e. it is, actually, O(n)
. The backward pass (called, in the context of DP, decoding) restores the spaces using the socalled backpointers stored in the dp
array. Below is a simplistic implementation that returns a single match. A recursive variant is possible with or without a backward pass that will accumulate all the possible variants.
(defun dprestorespaces (dict str)
(let ((dp (makearray (1+ (length str)) :initialelement nil))
;; in the production implementation, the following calculation
;; should be performed at the preprocessing stage
(w (reduce 'max (mapcar 'length (keys dict))))
(begs (list))
(rez (list)))
;; the outer loop tries to find the next word
;; only starting from the ends of the words that were found previously
(do ((i 0 (pop begs)))
((or (null i)
(= i (length str))))
;; the inner loop checks all substrings of length 1..w
(do ((j (1+ i) (1+ j)))
((>= j (1+ (min (length str)
(+ w i)))))
(when (? dict (slice str i j))
(:= (? dp j) i)
(push j begs)))
(:= begs (reverse begs)))
;; the backward pass
(do ((i (length str) (? dp i)))
((null (? dp i)))
(push (slice str (? dp i) i) rez))
(strjoin #\Space rez)))
CLUSER> (dprestorespaces *dict* "thisisatest")
"this is a test"
Similarly to the Fibonacci numbers, the solution to this problem doesn't use any additional information to choose between several variants of a split; it just takes the first one. However, if we wanted to find the variant that is most plausible to the human reader, we'd need to add some measure of plausibility. One idea might be to use a frequency dictionary, i.e. prefer the words that have a higher frequency of occurrence in the language. Such an approach, unfortunately, also has drawbacks: it overemphasizes short and frequent words, such as determiners, and also doesn't account for how words are combined in context. A more advanced option would be to use a frequency dictionary not just of words but of separate phrases (ngrams). The longer the phrases are used, the better from the standpoint of linguistics, but also the worse from the engineering point of view: more storage space needed, more data to process if we want to collect reliable statistics for all the possible variants. And, once again, with the rise of the number of words in an ngram, we will be facing the issue of combinatorial explosion petty soon. The optimal point for this particular task might be bigrams or trigrams, i.e. phrases of 2 or 3 words. Using them, we'd have to supply another dictionary to our procedure and track the measure of plausibility of the current split as a product of the frequencies of the selected ngrams. Formulated this way, our exercise becomes not merely an algorithmic task but an optimization problem. And DP is also suited to solving such problems. In fact, that was the primary purpose it was intended for, in the Operations Research community. We'll see it in action with our next problem  text justification. And developing a restorespacesplausibly
procedure is left as an exercise to the reader. :)
Text Justification
The task of text justification is relevant to both editing and reading software: given a text, consisting of paragraphs, split each paragraph into lines that contain whole words only with a given line length limit so that the variance of line lengths is the smallest. Its solution may be used, for example, to display text in HTML blocks with an align=justify
property.
A more formal task description would be the following:

the algorithm is given a text string and a line length limit (say, 80 characters)

there's a plausibility formula that specifies the penalty for each line being shorter than the length limit. A usual formula is this:
(defun penalty (limit length)
(if (<= length limit)
(expt ( limit length) 3)
mostpositivefixnum)) 
the result should be a list of strings
As we are discussing this problem in the context of DP, first, we need to determine what is its optimal substructure. Superficially, we could claim that lines in the optimal solution should contain only the lines that have the smallest penalty, according to the formula. However, this doesn't work as some of the potential lines that have the best plausibility (length closest to 80 characters) may overlap, i.e. the optimal split may not be able to include all of them. What we can reliably claim is that, if the text is already justified from position 0 to i
, we can still justify the remainder optimally regardless of how the prefix is split into lines. This is, basically, the same as with string segmentation where we didn't care how the string was segmented before position i
. And it's a common theme in DP problems: the key feature that allows us to save on redundant computation is that we only remember the optimal result of the computation that led to a particular partial solution, but we don't care about what particular path was taken to obtain it (except we care to restore the path, but that's what the backpointers are for  it doesn't impact the forward pass of the algorithm). So the optimal substructure property of text justification is that if the best split of the whole string includes the consecutive indices x
and y
, then the best split from 0 to y
should include x
.
Let's justify the following text with a line limit of 50 chars:
Common Lisp is the modern, multiparadigm, highperformance, compiled, ANSIstandardized,
most prominent descendant of the longrunning family of Lisp programming languages.
Suppose we've already justified the first 104 characters. This leaves us with a suffix that has a length of 69: descendant of the longrunning family of Lisp programming languages.
As its length is above 50 chars, but below 100, so we can conclude that it requires exactly 1 split. This split may be performed after the first, second, third, etc. token. Let's calculate the total plausibility of each candidate:
after "the": 5832 + 0 = 5832
after "longrunning": 6859 + 2197 = 9056
after "family": 1728 + 8000 = 9728
after "of": 729 + 12167 = 12896
after "Lisp": 64 + 21952 = 22016
So, the optimal split starting at index 105[4] is into strings: "descendant of the"
and "longrunning family of Lisp programming languages."
Now, we haven't guaranteed that index 105 will be, in fact, the point in the optimal split of the whole string, but, if it were, we would have already known how to continue. This is the key idea of the DPbased justification algorithm: starting from the end, calculate the cost of justifying the remaining suffix after each token using the results of previous calculations. At first, while suffix length is below line limit they are trivially computed by a single call to the plausibility function. After exceeding the line limit, the calculation will consist of two parts: the plausibility penalty + the previously calculated value.
(defun justify (limit str)
(with ((toks (reverse (split #\Space str)))
(n (length toks))
(penalties (makearray n))
(backptrs (makearray n))
(lengths (makearray n)))
;; forward pass (from the end of the string)
(doindex (i tok toks)
(let ((len (+ (length tok) (if (> i 0)
(? lengths (1 i))
0))))
(:= (? lengths i) (1+ len))
(if (<= len limit)
(:= (? penalties i) (penalty len limit)
(? backptrs i) 1)
;; minimization loop
(let ((min mostpositivefixnum)
arg)
(dotimes (j i)
(with ((j ( i j 1))
(len ( (? lengths i)
(? lengths j)))
(penalty (+ (penalty len limit)
(? penalties j))))
(when (> len limit) (return))
(when (< penalty min)
(:= min penalty
arg j))))
(:= (? penalties i) min
(? backptrs i) arg)))))
;; backward pass (decoding)
(loop :for end := (1 n) :then beg
:for beg := (? backptrs end)
:do (format nil "~A~%" (strjoin #\Space (reverse (subseq toks (1+ beg) (1+ end)))))
:until (= 1 beg))))
CLUSER> (justify 50 "Common Lisp is the modern, multiparadigm, highperformance, compiled, ANSIstandardized,
most prominent descendant of the longrunning family of Lisp programming languages.")
Common Lisp is the modern, multiparadigm,
highperformance, compiled, ANSIstandardized,
most prominent descendant of the longrunning
family of Lisp programming languages.
This function is somewhat longer, but, conceptually, it is pretty simple. The only insight I needed to implement it efficiently was the additional array for storing the lengths
of all the string suffixes we have examined so far. This way, we apply memoization twice: to prevent recalculation of both the penalties and the suffix lengths, and all of the ones we have examined so far are used at each iteration. If we were to store the suffixes themselves we would have had to perform an additional O(n)
length calculation at each iteration.
The algorithm performs two passes. In the forward pass (which is, in fact, performed from the end), it fills the slots of the DP arrays using the minimum joint penalty for the potential current line and the remaining suffix, the penalty for which was calculated during one of the previous iterations of the algorithm. In the backward pass, the resulting lines are extracted by traversing the backpointers starting from the last index.
The key difference from the previous DP example are these lines:
(:= (? penalties i) min
(? backptrs i) arg)
Adding them (alongside with the whole minimization loop) turns DP into an optimization framework that, in this case, is used to minimize the penalty. The backptrs
array, as we said, is used to restore the steps which have lead to the optimal solution. As, eventually (and this is true for the majority of the DP optimization problems), we care about this sequence and not the optimization result itself.
As we can see, for the optimization problems, the optimal substructure property is manifested as a mathematical formula called the recurrence relation. It is the basis for the selection of a particular substructure among several variants that may be available for the current step of the algorithm. The relation involves an already memoized partial solution and the cost of the next part we consider adding to it. For text justification, the formula is the sum of the current penalty and the penalty of the newly split suffix. Each DP optimization task is based on a recurrence relation of a similar kind.
Now, let's look at this problem from a different perspective. We can represent our decision space as a directed acyclic graph. Its leftmost node (the "source") will be index 0, and it will have several direct descendants: nodes with those indices in the string, at which we can potentially split it not exceeding the 50character line limit, or, alternatively, each substring that spans from index 0 to the end of some token and is not longer than 50 characters. Next, we'll connect each descendant node in a similar manner with all nodes that are "reachable" from it, i.e. they have a higher value of associated string position, and the difference between their index and this node is below 50. The final node of the graph ("sink") will have the value of the length of the string. The cost of each edge is the value of the penalty function. Now, the task is to find the shortest path from source to sink.
Here is the DAG for the example string with the nodes labeled with the indices of the potential string splits. As you can see, even for such a simple string, it's already quite big, what to speak of real texts. But it can provide some sense of the number of variants that an algorithm has to evaluate.
What is the complexity of this algorithm? On the surface, it may seem to be O(m^2)
where m
is the token count, as there are two loops: over all tokens and over the tail. However, the line (when (> len limit) (return))
limits the inner loop to only the part of the string that can fit into limit
chars, effectively, reducing it to a constant number of operations (not more than limit
, but, in practice, an order of magnitude less). Thus, the actual complexity is O(m)
[5].
Pathfinding Revisited
In fact, any DP problem may be reduced to pathfinding in the graph: the shortest path, if optimization is involved, or just any path otherwise. The nodes in this graph are the intermediate states (for instance, a split at index x
or an i
th Fibonacci number) and the edges  possible transitions that may bear an associated cost (as in text justification) or not (as in string segmentation). And the classic DP algorithm to solve the problem is called the BellmanForm algorithm. Not incidentally, one of its authors, Bellman is the "official" inventor of DP.
(defun bfshortestpath (g)
(with ((n (arraydimension g 0))
(edges (edgestable g))
(dists (makearray n :initialelment mostpositivefixnum))
(backptrs (makearray n))
(path (list)))
(:= (? dists (1 n)) 0)
(dotimes (v (vertices g))
(dotimes (e (? edges v))
(with ((u (src e))
(dist (+ (dist e)
(? dists u))))
(when (< dist (? dists u))
(:= (? dists u) dist
(? backptrs u) v)))))
(loop :for v := (1 n) :then (? backptrs v) :do
(push v path))
(values path
(? dists (1 n)))))
The code for the algorithm is very straightforward, provided that our graph representation already has the vertices and edges as a data structure in convenient format or implements such operations (in the worst case, the overall complexity should be not greater than O(V+E)
). For the edges, we need a kv indexed by the edge destination  an opposite to the usual representation that groups them by their sources[6].
Compared to text justification, this function looks simpler as we don't have to perform taskspecific processing that accounts for character limit and spaces between words. However, if we were to use bfshortestpath
, we'd have to first create the graph data structure from the original text. So all that complexity would go into the graph creation routine. However, from the architectural pointofviews, such split may be beneficial as the pathfinding procedure could be reused for other problems.
One might ask a reasonable question: how does BellmanFord fare against the Dijkstra's algorithm (DA)? As we have already learned, Dijkstra's is a greedy and optimal solution to pathfinding, so why consider yet another approach? Both algorithms operate by relaxation, in which approximations to the correct distance are replaced by better ones until the final result is reached. And in both of them, the approximate distance to each vertex is always an overestimate of the true distance, and it is replaced by the minimum of its old value and the length of a newly found path. Turns out that DA is also a DPbased approach. But with additional optimizations! It uses the same optimal substructure property and recurrence relations. The advantage of DA is the utilization of the priority queue to effectively select the closest vertex that has not yet been processed. Then it performs the relaxation process on all of its outgoing edges, while the BellmanFord algorithm relaxes all the edges. This method allows BF to calculate the shortest paths not to a single node but to all of them (which is also possible for DA but will make its runtime, basically, the same as for BF). So, BellmanFord complexity is O(V E)
compared to O(E + V logV)
for the optimal implementation of DA. Besides, BF can account for negative edge weights, which will break DA.
So, DA remains the algorithm of choice for the standard shortest path problem, and it's worth keeping in mind that it can also be also applied as a solver for some DP problems if they are decomposed into graph construction + pathfinding. However, some DP problems have additional constraints that make using DA for them pointless. For example, in text justification, the number of edges to consider at each step is limited by a constant factor, so the complexity of the exhaustive search is, in fact, O(V)
. Proving that for our implementation of justify
is left as an exercise to the reader...
LCS & Diff
Let's return to strings and the application of DP to them. The ultimate DPrelated string problem is string alignment. It manifests in many formulations. The basic one is the Longest Common Subsequence (LCS) task: determine the length of the common part among two input strings. Solving it, however, provides enough data to go beyond that  it enables determining the best alignment of the strings, as well as to enumerating the edit operations needed to transform one string into another. The edit operations, which are usually considered in the context of LCS are:
 insertion of a character
 deletion of a character
 substitution of a character
Based on the number of those operations, we can calculate a metric of commonality between two strings that is called the Levenstein distance. It is one of the examples of the socalled Edit distances. The identical strings have a Levenstein distance of 0, and strings foobar
and baz
 of 4 (3 deletion operations for the prefix foo
and a substitution operation of r
into z
). The are also other variants of edit distances. FOr instance, the DamerauLevenstein distance that is better suited to compare texts with misspellings produced by humans, adds another modification operation: swap
, which reduces the edit distance in the case of two adjacent characters being swapped to 1 instead of 2 for the Levenstein (1 deletion adn 1 insertion).
The Levenstein distance, basically, gives us for free the DP recurrence relations: when we consider the i
th character of the first string and the j
th one of the second, the edit distance between the prefixes 0,i
and 0,j
is either the same as for the pair of chars (1 i)
and (1 j)
respectively, if the current characters are the same, or 1+
the minimum of the edit distances of the pairs i (1 j)
, (1 i) (1 j)
, and (1i) j
.
We can encode this calculation as a function that uses a matrix for memoization. Basically, this is the DP solution to the LCS problem: now, you just have to subtract the length of the string and the bottom right element of the matrix, which will give you the measure of the difference between the strings.
(defun levdist (s1 s2 &optional
(i1 (1 (length s1)))
(i2 (1 (length s2)))
(ld (makearray (list (1+ (length s1))
(1+ (length s2)))
:initialelement nil)
ldp)) ; a flag indicating that the argument was supplied
;; initialization of the 0th column and row
(unless ldp
(dotimes (k (1+ (length s1))) (:= (aref ld k 0) 0))
(dotimes (k (1+ (length s2))) (:= (aref ld 0 k) 0)))
(values (or (aref ld (1+ i1) (1+ i2))
(:= (aref ld (1+ i1) (1+ i2))
(if (eql (? s1 i1) (? s2 i2))
(levdist s1 s2 (1 i1) (1 i2) ld)
(1+ (min (levdist s1 s2 (1 i1) (1 i2) ld)
(levdist s1 s2 i1 (1 i2) ld)
(levdist s1 s2 (1 i1) i2 ld))))))
ld))
However, if we want to also use this information to align the sequences, we'll have to make a reverse pass[]{{Here, a separate backpointers array isn't necessary as we can infer the direction by reversing the distance formula.}}.
(defun align (s1 s2)
(with ((i1 (length s1))
(i2 (length s2))
;; our Levenstein distance procedure returns the whole DP matrix
;; as a second value
(ld (nthvalue 1 (levdist s1 s2)))
(rez (list)))
(loop
(let ((min (min (aref ld (1 i1) (1 i2))
(aref ld i1 (1 i2))
(aref ld (1 i1) i2))))
(cond ((= min (aref ld (1 i1) (1 i2)))
(push (pair (? s1 (1 i1)) (? s2 (1 i2)))
rez)
(: i1)
(: i2))
((= min (aref ld (1 i1) i2))
(push (pair (? s1 (1 i1)) nil)
rez)
(: i1))
((= min (aref ld i1 (1 i2)))
(push (pair nil (? s2 (1 i2)))
rez)
(: i2))))
(when (= 0 i1)
(loop :for j :from (1 i2) :downto 0 :do
(push (pair #\* (? s2 j)) rez))
(return))
(when (= 0 i2)
(loop :for j :from (1 i1) :downto 0 :do
(push (pair (? s1 j) nil) rez))
(return)))
;; pretty output formatting
(withoutputtostring (s1)
(withoutputtostring (s2)
(withoutputtostring (s3)
(loop :for (c1 c2) :in rez :do
(format s1 "~C " (or c1 #\.))
(format s2 "~C " (cond ((null c1) #\↓)
((null c2) #\↑)
((char= c1 c2) #\)
(t #\x)))
(format s3 "~C " (or c2 #\.)))
(format t "~A~%~A~%~A~%"
(getoutputstreamstring s1)
(getoutputstreamstring s2)
(getoutputstreamstring s3)))))
rez))
CLUSER> (align "democracy" "remorse")
d e m o c r a c y
x    ↑  ↑ x x
r e m o . r . s e
CLUSER> (levdist "democracy" "remorse")
5
#2A((0 1 2 3 4 5 6 7)
(1 1 2 3 4 5 6 7)
(2 2 1 2 3 4 5 6)
(3 3 2 1 2 3 4 5)
(4 4 3 2 1 2 3 4)
(5 5 4 3 2 2 3 4)
(6 5 5 4 3 2 3 4)
(7 6 6 5 4 3 3 4)
(8 7 7 6 5 4 4 4)
(9 8 8 7 6 5 5 5))
It should be pretty clear how we can also extract the edit operations during the backward pass: depending on the direction of the movement, horizontal, vertical or diagonal, it's either an insertion, deletion or substitution. The same operations may be also grouped to reduce noise. The alignment task is an example of a 2d DP problem. Hence, the diff computation has a complexity of O(n^2)
. There are other notable algorithms, such as CYK parsing or the Viterbi algorithm, that also use a 2d array, although they may have higher complexity than just O(n^2)
. For instance, the CYK parsing is O(n^3)
, which is very slow compared to the greedy O(n)
shiftreduce algorithm.
However, the diff we will obtain from the basic LCS computation will still be pretty basic. There are many small improvements that are made by production diff implementation both on the UX and performance sides. Besides, the complexity of the algorithm is O(n^2)
, which is quite high, so many practical variants perform many additional optimizations to reduce the actual number of operations, at least, for the common cases.
The simplest improvement is a preprocessing step that is warranted by the fact that, in many applications, the diff is performed on texts that are usually mostly identical and have a small number of differences between them localized in an even smaller number of places. For instance, consider source code management, where diff plays an essential role: the programmers don't tend to rewrite whole files too often, on the contrary, such practice is discouraged due to programmer collaboration considerations.
So, some heuristics may be used in the library diff implementations to speed up such common cases:
 check that the texts are identical
 identify common prefix/suffix and perform the diff only on the remaining part
 detect situations when there's just a single or two edits
A perfect diff algorithm will report the minimum number of edits required to convert one text into the other. However, sometimes the result is too perfect and not very good for human consumption. People will expect operations parts to be separated at token boundaries when possible, also larger contiguous parts are preferred to an alteration of small changes. All these and other diff ergonomic issue may be addressed by various postprocessing tweaks.
But, besides these simple tricks, are global optimizations to the algorithm possible? After all, O(n^2)
space and time requirements are still pretty significant. Originally, diff was developed for Unix by Hunt and McIlroy. Their approach computes matches in the whole file and indexes them into the socalled kcandidates, k
being the LCS length. The LCS is augmented progressively by finding matches that fall within proper ordinates (following a rule explained in their paper). While doing this, each path is memoized. The problem with the approach is that it performs more computation than necessary: it memoizes all the paths, which requires O(n^2)
memory in the worst case, and O(n^2 log n)
for the time complexity!
The current standard approach is the divideandconquer Myers algorithm. It works by finding recursively the central match of two sequences with the smallest edit script. Once this is done only the match is memoized, and the two subsequences preceding and following it are compared again recursively by the same procedure until there is nothing more to compare. Finding the central match is done by matching the ends of subsequences as far as possible, and any time it is not possible, augmenting the edit script by 1 operation, scanning each furthest position attained up to there for each diagonal and checking how far the match can expand. If two matches merge, the algorithm has just found the central match. This approach has the advantage to using only O(n)
memory, and executes in O(n d)
, where d
is the edit script complexity (d
is less than n
, usually, much less). The Myers algorithm wins because it does not memoize the paths while working, and does not need to "foresee" where to go. So, it can concentrate only on the furthest positions it could reach with an edit script of the smallest complexity. The smallest complexity constraint ensures that what is found in the LCS. Unlike the HuntMcIlroy algorithm, the Myers one doesn't have to memoize the paths. In a sense, the Myers algorithm compared to the vanilla DP diff, like the Dijkstra's one versus BellmanFord, cuts down on the calculation of the editdistances between the substring that don't contribute to the optimal alignment. While solving LCS and building the whole editdistance matrix performs the computation for all substrings.
The diff tool is a prominent example of a transition from quite an abstract algorithm to a practical utility that is is an essential part of many ubiquitous software products, and the additional work needed to ensure that the final result is not only theoretically sane but also usable.
P.S. Ever wondered how github and other tools, when displaying the diff, not only show the changed line but also highlight the exact changes in the line? The answer is given in [7].
DP in Action: Backprop
As we said in the beginning, DP has applications in many areas: from Machine Learning to graphics to Source Code Management. Literally, you can find an algorithm that uses DP in every specialized domain, and if you don't  this means you, probably, can still advance this domain and create something useful by applying DP to it. Deep Learning is the fastest developing area of the Machine Learning domain, in recent years. At its core, the discipline is about training huge multilayer optimization functions called "neural networks". And the principal approach to doing that, which, practically speaking, has enabled the rapid development of machine learning techniques that we see today, is the Backpropagation (backprop) optimization algorithm.
As pointed out by Christopher Olah, for modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That's the difference between a model taking a week to train and taking 200,000 years. Beyond its use in deep learning, backprop is a computational tool that may be applied in many other areas, ranging from weather forecasting to analyzing numerical stability  it just goes by different names there. In fact, the algorithm has been reinvented at least dozens of times in different fields. The general, applicationindependent, name for it is ReverseMode Differentiation. Essentially, it's a technic for calculating partial derivatives quickly using DP on computational graphs.
Computational graphs are a nice way to think about mathematical expressions. For example, consider the expression (:= e (* (+ a b) (1+ b)))
. There are four operations: two additions, one multiplication, and an assignment. Let's arrange those computations in the same way they would be performed on the computer:
(let ((c (+ a b))
(d (1+ b)))
(:= e (* c d)))
To create a computational graph, we make each of these operations, along with the input variables, into nodes. When the outcome of one expression is an input to another one, a link points from one node to another:
We can evaluate the expression by setting the values in the input nodes (a
and b
) to certain values and computing nodes in the graph along the dependency paths. For example, let's set a
to 2 and b
to 1: the result in node e
will be, obviously, 6.
The derivatives in a computational graph can be thought of as edge labels. If a
directly affects c
, then we can write a partial derivative ∂c/∂a
along the edge from a
to c
.
Here is the computational graph with all the derivatives for the evaluation with the values of a
and b
set to 2 and 1.
But what if we want to understand how nodes that aren't directly connected affect each other. Let's consider how e
is affected by a
. If we change a
at a speed of 1, c
also changes at a speed of 1. In turn, c
changing at a speed of 1 causes e
to change at a speed of 2. So e
changes at a rate of (* 1 2)
with respect to a
. The general rule is to sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path together. We can see that this graph is, basically, the same as the graph we used to calculate the shortest path.
This is where Forwardmode differentiation and Reversemode differentiation come in. They're algorithms for efficiently computing the sum by factoring the paths. Instead of summing over all of the paths explicitly, they compute the same sum more efficiently by merging paths back together at every node. In fact, both algorithms touch each edge exactly once. Forwardmode differentiation starts at an input to the graph and moves towards the end. At every node, it sums all the paths feeding in. Each of those paths represents one way in which the input affects that node. By adding them up, we get the total derivative. Reversemode differentiation, on the other hand, starts at an output of the graph and moves towards the beginning. At each node, it merges all paths which originated at that node. Forwardmode differentiation tracks how one input affects every node. Reversemode differentiation tracks how every node affects one output.
So, what if we do reversemode differentiation from e
down? This gives us the derivative of e
with respect to every node. Forwardmode differentiation gave us the derivative of our output with respect to a single input, but reversemode differentiation gives us all of the derivatives we need in one go. When training neural networks, the cost is a function of the weights of each edge. And using reversemode differentiation (aka backprop), we can calculate the derivatives of the cost with respect to all the weights in a single pass through the graph, and then feed them into gradient descent. As there are millions and tens of millions of weights, in a neural network, reversemode differentiation results in a speedup of the same factor!
Backprop is an example of simple memoization DP. No selection of the best variant is needed, it's just a proper arrangement of the operations to avoid redundant computations.
Takeaways
DPbased algorithms may operate on one of these three levels:
 just systematic memoization, when every intermediate result is cached and used to compute subsequent results for larger problems (Fibonacci numbers, backprop)
 memoization + backpointers that allow for the reconstruction of the sequence of actions that lead to the final solution (text segmentation)
 memoization + backpointers + a target function that selects the best intermediate solution (text justification, diff, shortest path)
If we want to apply DP to some task, we need to find its optimal substructure: i.e. verify that an optimal solution to a subproblem will remain a part of the optimal solution to the whole problem. Next, if we deal with an optimization task, we may have to formulate the recurrence relations. After that, it's just a matter of technic: those relations may be either programmed directly as a recursive or iterative procedure (like in LCS) or indirectly using the method of consecutive approximations (like in BellmanFord).
Ultimately, all DP problems may be reduced to pathfinding in the graph, but it doesn't always make sense to have this graph explicitly as a data structure in the program. If it does, however, remember that Dijkstra's algorithm is the optimal algorithm to find a single shortest path in it.
DP, usually, is a reasonable next thing to think about after the naive greedy approach (which, let's be frank, everyone tends to take initially) stumbles over backtracking. However, we saw that DP and greedy approaches do not contradict each other: in fact, they can be combined as demonstrated by the Dijkstra's algorithm. Yet, an optimal greedy algorithm is more of an exception than a rule. Although, there is a number of problems for which a topn greedy solution (the socalled Beam search) can be a nearoptimal solution that is good enough.
Also, DP doesn't necessarily mean optimal. A vanilla dynamic programming algorithm exhaustively explores the decision space, which may be excessive in many cases. It is demonstrated by the examples of the Dijkstra's and Myers algorithms that improve on the DP solution by cutting down some of the corners.
P.S. We have also discussed, the first time in this book, the value of heuristic pre and postprocessing. From the theoretical standpoint, it is not something you have to pay attention to, but, in practice, that's a very important aspect of the production implementation of many algorithms and, thus, shouldn't be frowned upon or neglected. In an ideal world, an algorithmic procedure should both have optimal worstcase complexity and the fastest operation in the common cases.
[1] If you wonder, s
is a word that is usually present in English programmatic dictionaries because when it's
and friends are tokenized they're split into two tokens, and the apostrophe may be missing sometimes. Also, our dictionary is caseinsensitive.
[2] The intuition for it is the following: in the worst case, every character has two choices: either to be the last letter of the previous word or the first one of the next word, hence the branching factor is 2.
[3] Actually, the condition of complete string coverage may be lifted, which will allow to use almost the same algorithm but skip over "undictionary" words like misspellings.
[4] A space at the end of the line is discarded.
[5] Provided all the length calculations are implemented efficiently. For simplicity, I have used plain lists here with a linear length
complexity, but a separate variable may be added to avoid the extra cost.
[6] However, if we think of it, we could reuse the already proven linked representation just putting the incoming edges into the node structure instead of the outgoing ones.
[7] It runs diff twice: first, at the linelevel (using each line as a single unit/token) and then at the character level, as you would normally expect. Then the results are just combined.
13 Dec 2019 10:40am GMT
12 Dec 2019
Planet Lisp
Joe Marshall: I think you left something out
I recently had the pleasure of working with some autogenerated Java code. All the accessors and mutators were autogenerated, so was the predicate. The only thing missing was the constructor. You could manipulate one of these objects in any way you desired, but there was no easy way to instantiate one in the first place. I eventually figured out a way to instantiate one through reflection. I created a JSON version of the object and deserialized it. Crude, but effective.
I ran into a similar problem where I needed an instance of an object, but some idiot decided to make all the constructors protected
. In this case, I created an unwanted subclass with a public
constructor, called that instead, and then upcast to the object I really wanted.
Of course I was using these objects in a way the author of the code did not intend they be used, but the author wasn't omniscient and didn't foresee my use case (which was testing). By trying to prevent unintended uses, he simply forced me to write nasty code to get around his ultimately ineffective roadblocks. The author could have spent his time more productively creating a library with a more flexible API to handle unforeseen uses rather than inventing useless impediments to getting the job done.
I still cannot seem to reply to comments. I type a nice reply, hit the "publish" button, and the reply disappears, never to be seen again.
I'm working at a company that requires 100% branch coverage in tests. This requires teasing whitebox libraries into returning malformed replies to cover all the branches in the error handler  even the ones that cannot actually be taken because the library doesn't actually generate malformed replies. So I have to supply the malformed replies myself, and it isn't easy if there aren't data constructors. We don't have AspectJ, and the source code isn't always available. And even when it is, it's simply not a good idea to modify it so that it is capable of generating errors it wouldn't otherwise be able to. This leaves me to resort to the hacks.
I was wrong. I just stumbled across the first example. It turns out that the constructor was generated, it just had no arguments. Now this is not unusual in Java (although I consider it bad form. You should be able to construct and initialize object in one step rather than have to construct an skeleton object and smash in the field values. This reduces your chances of creating partially initialized objects.) What made this a problem is that the generated code had no mutators, so you could only create skeleton objects, you couldn't put any values in them. I did remember correctly that I deserialized a JSON version to create an object with actual values in it. This seemed a better option than directly using reflection because the JSON object "looked" like the resulting objecct I wanted.
I'm no fan of just making every field mutable by default  just the opposite, in fact. Objects should usually be immutable. But you ought to have a way to create an initialized object without resorting to reflection.
12 Dec 2019 5:36pm GMT
10 Dec 2019
Planet Lisp
Joe Marshall: (map append elements)
I can recall the difficulty I had understanding let
expressions in Lisp. I used to write out the equivalent lambda
expression and then manually add the syntactic sugar to make it into a let
. And I don't recall having an easy time with the chapter on linked lists. So I find it amusing to see what tortuous code students come up with when they are first introduced to linked lists. My favorite so far has been (map
append elements)
, which looks like it might join the sublists in elements
, but if you think about it, the arity is all wrong: map doesn't apply its function, it calls it. This had me scratching my head for several minutes until I realized you can call apply on a single argument.
So what little gems of code have you discovered that perform the most trivial of tasks in the most convoluted or obscure manner?
10 Dec 2019 4:56pm GMT
06 Dec 2019
Planet Lisp
Joe Marshall: "No 'x' in 'Nixon'"
I just had a phone screen where they asked if I could write a palindrome detector. I wrote a recursive and an iterative version. I like the recursive solution because it is so simple. There are only three equations:
 An empty string is a palindrome.
 A string with one character is a palindrome.
 If a string starts and ends with the same character, and the substring between them is a palindrome, then the entire string is a palindrome.
Apparently I cannot post comments on my own blog. Well, I can edit the post. "if s[pos] != s[pos1]" sure looks like two pointers, but to be fair, the recursive solution hides its pointers in the string structure or in the slices if it is using them. As for "(string= x (stringreverse x))", there is something elegant in something that crude.
06 Dec 2019 9:44pm GMT
05 Dec 2019
Planet Lisp
Zach Beane: Working with letsencrypt's certbot for a Lisp webserver
Every 90 days my letsencrypt certificate expires and I renew it manually. I have to cut and paste data from the certbot script into repl forms to serve the right data on the right path and it's such a pain that sometimes I put it off until after it expires, and people email me about it, and I feel bad.
The manual process looks something like this (not my actual domain or challenges):
# certbot certonly manual d my.example.com
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Cert is due for renewal, autorenewing...
Renewing an existing certificate
Performing the following challenges:
http01 challenge for my.example.com
                                       
NOTE: The IP of this machine will be publicly logged as having requested this
certificate. If you're running certbot in manual mode on a machine that is not
your server, please ensure you're okay with that.
Are you OK with your IP being logged?
                                       
(Y)es/(N)o: y
                                       
Create a file containing just this data:
uaut.gj61B3f7oYQcWZSF4kxS4OFh8KQlsDVtPXrw60Tkj2JLW7RtZaVE0MIWwiEKRlxph7SaLwp1ETFjaGDUKN
And make it available on your web server at this URL:
http://my.example.com/.wellknown/acmechallenge/SpZa7sf4QMEFoF7lKh7aT4EZjNWSVcur2jODPQkgExa
                                       
Press Enter to Continue
There's no way that letsencrypt's certbot program will automatically get support for my custom hunchentoot "framework" and I don't really want to look into adding it myself.
For a while I thought about writing some Expect functionality in SBCL  sbext:runprogram
already supports a :pty
option so it wouldn't be superdifficult. With a theoretical clexpect you could spawn certbot with the right options, slurp out the verification secret and URL via the process's standard output stream, and call whatever code you need to serve the secret at that location.
I realized today there's a halfway solution that takes the worst cutting and pasting out of the loop. The unix command script flush starts an interactive session where all input and output is saved to a file. My Lisp webserver can then read certbot program output from that file and configure the webserver automagically for me. I still have to manually start certbot and enter a few REPL commands, but it's easier than before.
Here's the core of Lisp side of things for processing the script
file:
(defun scrapeletsencryptinteraction (file)
(let (secret path)
(withopenfile (stream file)
(labels (...)
(loop
(skippast "Create a file containing")
(nextline)
(setf secret (readtrimmedline))
(skippast "make it available")
(nextline)
(setf path (substringafter ".wellknown" (readtrimmedline))))))))
This could be done just as well (perhaps with clppcre, but I didn't want to pull it in as dependency.
Here are the functions from labels
:
((nextline ()
(let ((line (readline stream nil)))
(when (null line)
(unless (and secret path)
(error "Didn't find a secret and path anywhere in ~S"
file))
(returnfrom scrapeletsencryptinteraction
(values secret path)))
line))
(skippast (string)
(loop
(let ((line (nextline)))
(when (search string line)
(return)))))
(readtrimmedline ()
(stringtrim '(#\Return)
(nextline)))
(substringafter (string target)
(let ((pos (search string target)))
(unless pos
(error "Could not find ~S in ~S" string target))
(subseq target pos))))
The goal here is to only look at the last secret and URL in the script output, so the main loop keeps track of what it's seen so far and nextline
returns those values at EOF. The output also has ugly trailing ^M
noise so readtrimmedline
takes care of that.
Here's the whole thing all together:
(defun scrapeletsencryptinteraction (file)
(let (secret path)
(withopenfile (stream file)
(labels ((nextline ()
(let ((line (readline stream nil)))
(when (null line)
(unless (and secret path)
(error "Didn't find a secret and path anywhere in ~S"
file))
(returnfrom scrapeletsencryptinteraction
(values secret path)))
line))
(skippast (string)
(loop
(let ((line (nextline)))
(when (search string line)
(return)))))
(readtrimmedline ()
(stringtrim '(#\Return)
(nextline)))
(substringafter (string target)
(let ((pos (search string target)))
(unless pos
(error "Could not find ~S in ~S" string target))
(subseq target pos))))
(loop
(skippast "Create a file containing")
(nextline)
(setf secret (readtrimmedline))
(skippast "make it available")
(nextline)
(setf path (substringafter ".wellknown" (readtrimmedline))))))))
I don't mind shaving only half a yak when it feels like useful progress!
Someday I'll get around to a proper Expectlike thing…
05 Dec 2019 11:35pm GMT
02 Dec 2019
Planet Lisp
Wimpie Nortje: HTTP Routing libraries for Hunchentoot
Hunchentoot is a web server and not a heavy weight web development framework. It provides methods to implement URL paths for static files, directory paths and with regex's, very much like the mainstream web servers. It does not provide an easy way to define routes for a REST API. When you use Hunchentoot without a web development framework you will probably want something to make route definition easier.
There are a few options for building REST APIs available in frameworks, Hunchentoot derivatives or other web servers but I wanted to implement REST routes with the original Hunchentoot web server. I found three libraries that can do this: simpleroutes, Froute and easyroutes.
Simpleroutes is the simplest and easiest to use. Routes are defined in a list similar to Hunchentoot's *dispatchtable*
. It supports variables in the URL segments but there is no support for middleware^{1} type functionality.
Froute is the most powerful of the three. It is based on CLOS and is designed so it can be used with any web server although only a Hunchentoot connector is currently implemented. Routes are defined as CLOS classes and even though middleware is not a specific feature the documentation gives an example on how to use class inheritance to implement such functionality. The power of being CLOS based also makes this library the most complex to use.
Easyroutes has the concept of decorators which are functions that execute before the route body so it can be used to implement middleware functionality. Unlike Clack's middleware which are defined at a central place for all routes, decorators need to be applied to each route handler individually. It's not quite there, but close enough.
The lack of middleware options disqualified simpleroutes for me and Froute looked like it provides everything I need, and more, but with much greater complexity than easyroutes. I decided to use easyroutes with the option to switch to Froute when I needed the extra capability.
Hunchentoot takes an "acceptor" argument at startup. Easyroutes provides two options: easyroutesacceptor
and routesacceptor
. Easyroutesacceptor
first executes all the route handlers and if no suitable handler is found it executes the normal Hunchentoot request handlers. The routesacceptor
executes only the route handlers and returns an 404 NOT FOUND
error if no suitable handler is found.
I use routesacceptor
because it ensures that only requests with explicitly defined handlers are handled. With the easyroutesacceptor
it is too easy to create a security hole with some default Hunchentoot request handler that catches nonexistent routes. It can be burdensome to use this approach for handling static files but I run Hunchentoot behind Nginx which also handles the static files.
The table summarises the main features I investigated:
simpleroutes  easyroutes  Froute  

Web server  Hunchentoot  Hunchentoot  Hunchentoot (can be expanded to others) 
REST routes  Yes  Yes  Yes 
Argument extraction from URL  Yes  Yes  Yes 
Dispatch based on HTTP method  Yes  Yes  Yes 
Middleware  No  Decorators  CLOS inheritance 
Fallback for undefined routes  Hunchentoot easyhandler framework  None or Hunchentoot easyhandler framework  None 
Learning curve  Negligible  Minimal  Medium. Requires some CLOS knowledge. 

Middleware are functions that run before or after the main request handler. They are called as part of the request handling process and not by the handler. This makes them ideal to handle general functions like setting up a database connection, performing authorisation or any task which is not part of a particular request handler. ↩
02 Dec 2019 12:00am GMT
01 Dec 2019
Planet Lisp
Paul Khuong: A Multiset of Observations With Constanttime Sample Mean and Variance
Fixed notation issues in the "Faster multiset updates" section. Thank you Joonas.
Let's say you have a multiset (bag) of "reals" (floats or rationals), where each value is a sampled observations. It's easy to augment any implementation of the multiset ADT to also return the sample mean of the values in the multiset in constant time: track the sum of values in the multiset, as they are individually added and removed. This requires one accumulator and a counter for the number of observations in the multiset (i.e., constant space), and adds a constant time overhead to each update.
It's not as simple when you also need the sample variance of the multiset \(X\), i.e.,
\[\frac{1}{n  1} \sum\sb{x \in X} (x  \hat{x})\sp{2},\]
where \(n = X\) is the sample size and \(\hat{x}\) is the sample mean \(\sum\sb{x\in X} x/n,\) ideally with constant query time, and constant and update time overhead.
One could try to apply the textbook equality
\[s\sp{2} = \frac{1}{n(n1)}\left[n\sum\sb{x\in X} x\sp{2}  \left(\sum\sb{x\in X} x\right)\sp{2}\right].\]
However, as Knuth notes in TAoCP volume 2, this expression loses a lot of precision to roundoff in floating point: in extreme cases, the difference might be negative (and we know the variance is never negative). More commonly, we'll lose precision when the sampled values are clustered around a large mean. For example, the sample standard deviation of 1e8
and 1e8  1
is 1
, same as for 0
and 1
. However, the expression above would evaluate that to 0.0
, even in double precision: while 1e8
is comfortably within range for double floats, its square 1e16
is outside the range where all integers are represented exactly.
Knuth refers to a better behaved recurrence by Welford, where a running sample mean is subtracted from each new observation before squaring. John Cook has a C++
implementation of the recurrence that adds observations to a sample variance in constant time. In Python, this streaming algorithm looks like this.
streaming_variance.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

That's all we need for insertonly multisets, but does not handle removals; if only we had removals, we could always implement updates (replacement) as a removal and an insertion.
Luckily, StreamingVariance.observe
looks invertible. It's shouldn't be hard to recover the previous sample mean, given v
, and, given the current and previous sample means, we can reevaluate (v  old_mean) * (v  self.mean)
and subtract it from self.var_sum
.
Let \(\hat{x}\sp{\prime}\) be the sample mean after observe(v)
. We can derive the previous sample mean \(\hat{x}\) from \(v\):
\[(n  1)\hat{x} = n\hat{x}\sp{\prime}  v \Leftrightarrow \hat{x} = \hat{x}\sp{\prime} + \frac{\hat{x}\sp{\prime}  v}{n1}.\]
This invertibility means that we can undo calls to observe
in LIFO order. We can't handle arbitrary multiset updates, only a stack of observation. That's still better than nothing.
variance_stack.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 

Before going any further, let's test this.
Testing the VarianceStack
The best way to test the VarianceStack
is to execute a series of push
and pop
calls, and compare the results of get_mean
and get_variance
with batch reference implementations.
I could hardcode calls in unit tests. However, that quickly hits diminishing returns in terms of marginal coverage VS developer time. Instead, I'll be lazy, completely skip unit tests, and rely on Hypothesis, its high level "stateful" testing API in particular.
We'll keep track of the values pushed and popped off the observation stack in the driver: we must make sure they're matched in LIFO order, and we need the stack's contents to compute the reference mean and variance. We'll also want to compare the results with reference implementations, modulo some numerical noise. Let's try to be aggressive and bound the number of float values between the reference and the actual results.
variance_stack_driver.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 

This initial driver does not even use the VarianceStack
yet. All it does is push values to the reference stack, pop values when the stack has something to pop, and check that the reference implementations match themselves after each call: I want to first shake out any bug in the test harness itself.
Not surprisingly, Hypothesis does find an issue in the reference implementation:
Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=2.6815615859885194e+154)
state.teardown()
We get a numerical OverflowError
in reference_variance
: 2.68...e154 / 2
is slightly greater than sqrt(sys.float_info.max) = 1.3407807929942596e+154
, so taking the square of that value errors out instead of returning infinity.
Let's start by clamping the range of the generated floats.
variance_stack_driver.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Now that the test harness doesn't find fault in itself, let's hook in the VarianceStack
, and see what happens when only push
calls are generated (i.e., first test only the standard streaming variance algorithm).
variance_stack_driver.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

This already fails horribly.
Falsifying example:
state = VarianceStackDriver()
state.push(v=1.0)
state.push(v=1.488565707357403e+138)
state.teardown()
F
The reference finds a variance of 5.54e275
, which is very much not the streaming computation's 1.108e276
. We can manually check that the reference is wrong: it's missing the n  1
correction term in the denominator.
We should use this updated reference.
variance_stack_driver.py
1 2 3 4 5 6 7 8 9 

Let's now reenable calls to pop()
.
variance_stack_driver.py
1 2 3 4 5 6 7 8 

And now things fail in new and excitingly numerical ways.
Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=0.00014142319560050964)
state.push(v=14188.9609375)
state.pop()
state.teardown()
F
This counterexample fails with the online variance returning 0.0
instead of 1e8
. That's not unexpected: removing (the square of) a large value from a running sum spells catastrophic cancellation. It's also not that bad for my use case, where I don't expect to observe very large values.
Another problem for our test harness is that floats are very dense around 0.0
, and I'm ok with small (around 1e8
) absolute error because the input and output will be single floats.
Let's relax assert_almost_equal
, and restrict generated observations to fall in \([2\sp{12}, 2\sp{12}].\)
variance_stack_driver.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

With all these tweaks to make sure we generate easy (i.e., interesting) test cases, Hypothesis fails to find a failure after its default time budget.
I'm willing to call that a victory.
From stack to full multiset
We have tested code to undo updates in Welford's classic streaming variance algorithm. Unfortunately, inverting push
es away only works for LIFO edits, and we're looking for arbitrary inserts and removals (and updates) to a multiset of observations.
However, both the mean \(\hat{x} = \sum\sb{x\in X} x/n\) and the centered second moment \(\sum\sb{x\in X}(x  \hat{x})\sp{2}\) are orderindependent: they're just sums over all observations. Disregarding roundoff, we'll find the same mean and second moment regardless of the order in which the observations were pushed in. Thus, whenever we wish to remove an observation from the multiset, we can assume it was the last one added to the estimates, and pop it off.
We think we know how to implement running mean and variance for a multiset of observations. How do we test that with Hypothesis?
The hardest part about testing dictionary (map)like interfaces is making sure to generate valid identifiers when removing values. As it turns out, Hypothesis has builtin support for this important use case, with its Bundles. We'll use that to test a dictionary from observation name to observation value, augmented to keep track of the current mean and variance of all values.
variance_multiset_driver.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 

Each call to add_entry
will either go to update_entry
if the key already exists, or add an observation to the dictionary and streaming estimator. If we have a new key, it is added to the keys
Bundle; calls to del_entry
and update_entry
draw keys from this Bundle. When we remove an entry, it's also consumed from the keys
Bundle.
Hypothesis finds no fault with our new implementation of dictionarywithvariance, but update
seems like it could be much faster and numerically stable, and I intend to mostly use this data structure for calls to update
.
Faster multiset updates
The key operation for my usecase is to update one observation by replacing its old
value with a new
one. We can maintain the estimator by popping old
away and pushing new
in, but this business with updating the number of observation n
and rescaling everything seems like a lot of numerical trouble.
We should be able to do better.
We're replacing the multiset of sampled observations \(X\) with \(X\sp{\prime} = X \setminus \{\textrm{old}\} \cup \{\textrm{new}\}.\) It's easy to maintain the mean after this update: \(\hat{x}\sp{\prime} = \hat{x} + (\textrm{new}  \textrm{old})/n.\)
The update to self.var_sum
, the sum of squared differences from the mean, is trickier. We start with \(v = \sum\sb{x\in X} (x  \hat{x})\sp{2},\) and we wish to find \(v\sp{\prime} = \sum\sb{x\sp{\prime}\in X\sp{\prime}} (x\sp{\prime}  \hat{x}\sp{\prime})\sp{2}.\)
Let \(\delta = \textrm{new}  \textrm{old}\) and \(\delta\sb{\hat{x}} = \delta/n.\) We have \[\sum\sb{x\in X} (x  \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} [(x  \hat{x})  \delta\sb{\hat{x}}]\sp{2},\] and \[[(x  \hat{x})  \delta\sb{\hat{x}}]\sp{2} = (x  \hat{x})\sp{2}  2\delta\sb{\hat{x}} (x  \hat{x}) + \delta\sb{\hat{x}}\sp{2}.\]
We can reassociate the sum, and find
\[\sum\sb{x\in X} (x  \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} (x  \hat{x})\sp{2}  2\delta\sb{\hat{x}} \left(\sum\sb{x \in X} x  \hat{x}\right) + n \delta\sb{\hat{x}}\sp{2}\]
Once we notice that \(\hat{x} = \sum\sb{x\in X} x/n,\) it's clear that the middle term sums to zero, and we find the very reasonable
\[v\sb{\hat{x}\sp{\prime}} = \sum\sb{x\in X} (x  \hat{x})\sp{2} + n \delta\sb{\hat{x}}\sp{2} = v + \delta \delta\sb{\hat{x}}.\]
This new accumulator \(v\sb{\hat{x}\sp{\prime}}\) corresponds to the sum of the squared differences between the old observations \(X\) and the new mean \(\hat{x}\sp{\prime}\). We still have to update one observation from old
to new
. The remaining adjustment to \(v\) (self.var_sum
) corresponds to going from \((\textrm{old}  \hat{x}\sp{\prime})\sp{2}\) to \((\textrm{new}  \hat{x}\sp{\prime})\sp{2},\) where \(\textrm{new} = \textrm{old} + \delta.\)
After a bit of algebra, we get \[(\textrm{new}  \hat{x}\sp{\prime})\sp{2} = [(\textrm{old}  \hat{x}\sp{\prime}) + \delta]\sp{2} = (\textrm{old}  \hat{x}\sp{\prime})\sp{2} + \delta (\textrm{old}  \hat{x} + \textrm{new}  \hat{x}\sp{\prime}).\]
The adjusted \(v\sb{\hat{x}\sp{\prime}}\) already includes \((\textrm{old}  \hat{x}\sp{\prime})\sp{2}\) in its sum, so we only have to add the last term to obtain the final updated self.var_sum
\[v\sp{\prime} = v\sb{\hat{x}\sp{\prime}} + \delta (\textrm{old}  \hat{x} + \textrm{new}  \hat{x}\sp{\prime}) = v + \delta [2 (\textrm{old}  \hat{x}) + \textrm{new}  \hat{x}\sp{\prime}].\]
That's our final implementation for VarianceBag.update
, for which Hypothesis also fails to find failures.
VarianceBag.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

How much do you trust testing?
We have automated propertybased tests and some humanchecked proofs. Ship it?
I was initially going to ask a CAS to check my reformulations, but the implicit \(\forall\) looked messy. Instead, I decided to check the induction hypothesis implicit in VarianceBag.update
, and enumerate all cases up to a certain number of values with Z3 in IPython.
In [1]: from z3 import *
In [2]: x, y, z, new_x = Reals("x y z new_x")
In [3]: mean = (x + y + z) / 3
In [4]: var_sum = sum((v  mean) * (v  mean) for v in (x, y, z))
In [5]: delta = new_x  x
In [6]: new_mean = mean + delta / 3
In [7]: delta_mean = delta / 3
In [8]: adjustment = delta * (2 * (x  mean) + (delta  delta_mean))
In [9]: new_var_sum = var_sum + adjustment
# We have our expressions. Let's check equivalence for mean, then var_sum
In [10]: s = Solver()
In [11]: s.push()
In [12]: s.add(new_mean != (new_x + y + z) / 3)
In [13]: s.check()
Out[13]: unsat # No counter example of size 3 for the updated mean
In [14]: s.pop()
In [15]: s.push()
In [16]: s.add(new_mean == (new_x + y + z) / 3) # We know the mean matches
In [17]: s.add(new_var_sum != sum((v  new_mean) * (v  new_mean) for v in (new_x, y, z)))
In [18]: s.check()
Out[18]: unsat # No counter example of size 3 for the updated variance
Given this script, it's a small matter of programming to generalise from 3 values (x
, y
, and z
) to any fixed number of values, and generate all small cases up to, e.g., 10 values.
z3check.py
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

I find the most important thing when it comes to using automated proofs is to insert errors and confirm we can find the bugs we're looking for.
I did that by manually mutating the expressions for new_mean
and new_var_sum
in updated_expressions
. This let me find a simple bug in the initial implementation of test_num_var
: I used if not result
instead of result != unsat
, and both sat
and unsat
are truthy. The code initially failed to flag a failure when z3
found a counterexample for our correctness condition!
And now I'm satisfied
I have code to augment an arbitrary multiset or dictionary with a running estimate of the mean and variance; that code is based on a classic recurrence, with some new math checked by hand, with automated tests, and with some exhaustive checking of small inputs (to which I claim most bugs can be reduced).
I'm now pretty sure the code works, but there's another more obviously correct way to solve that update problem. This 2008 report by Philippe Pébay^{1} presents formulas to compute the mean, variance, and arbitrary moments in one pass, and shows how to combine accumulators, a useful operation in parallel computing.
We could use these formulas to augment an arbitrary \(k\)ary tree and recombine the merged accumulator as we go back up the (search) tree from the modified leaf to the root. The update would be much more stable (we only add and merge observations), and incur logarithmic time overhead (with linear space overhead). However, given the same time budget, and a logarithmic space overhead, we could also implement the constanttime update with arbitrary precision software floats, and probably guarantee even better precision.
The constanttime update I described in this post demanded more effort to convince myself of its correctness, but I think it's always a better option than an augmented tree for serial code, especially if initial values are available to populate the accumulators with batchcomputed mean and variance. I'm pretty sure the code works, and it's up in this gist. I'll be reimplementing it in C++ because that's the language used by the project that lead me to this problem; feel free to steal that gist.

There's also a 2016 journal article by Pébay and others with numerical experiments, but I failed to implement their simplerlooking scalar update...↩
01 Dec 2019 4:51am GMT
30 Nov 2019
Planet Lisp
Quicklisp news: November 2019 Quicklisp dist update now available
New projects:
 cacau  Test Runner in Common Lisp.  GPLv3
 clelastic  Elasticsearch client for Common Lisp  MIT
 cllibiio  Common Lisp bindings for libiio.  GPLv3
 cltransmission  A Common Lisp library to interface with transmission using its rpc  MIT
 dartscltools  More or less useful utilities  MIT
 plainodbc  a lisp wrapper around the ODBC library  BSD
 sanityclause  Sanity clause is a data contract and validation library.  LGPLv3
 tfm  A TeX Font Metrics library for Common Lisp  BSD
 tmpdir  Simple library to create temporary directories  MIT license
 trivialmethodcombinations  Portability library for accessing method combination objects  Unlicense
 winlock  File locking using the Windows API.  MIT
 withuserabort  provides an easy way to catch ctrl+c. useful for making binaries.  BSD 3Clause
Updated projects: april, assertp, assertionerror, asyncprocess, babel, bordeauxthreads, bp, ceramic, chameleon, chanl, cl+ssl, clasync, clbnf, clcollider, cldct, cleditdistance, clenvironments, clfond, cli18n, clkraken, clkyotocabinet, cllzma, clnaivestore, clopengl, clpatterns, clprevalence, clpslib, clrdkafka, clredis, clsdl2, clsmtp, clstore, clstr, cltiled, cltoml, cltorrents, closermop, clsqllocaltime, clss, clx, clxxembed, coleslaw, commonlispactors, conf, croatoan, currycomposereadermacros, dartsclhashtree, declt, deploy, drakma, dufy, eagerfuture2, eventbus, femlisp, fiasco, gendl, genericcl, graph, horner, hunchentootmultiacceptor, jsown, lake, letplus, lichatprotocol, linearprogramming, literatelisp, lyrics, magicl, maiden, markup, mcclim, metabangbind, mmap, nodgui, numutils, overlord, parachute, paren6, petalisp, phoetoolbox, polisher, postmodern, py4cl, qbase64, qtlibs, qtools, quilc, quri, qvm, replic, roan, rutils, scextensions, scalpl, sel, serapeum, shadow, simpleparalleltasks, simplet, skeletoncreator, sly, staple, staticvectors, studioclient, stumpwm, swapbytes, sxql, thecostofnothing, trivia, trivialbenchmark, trivialjumptables, trivialpackagelocalnicknames, trivialssh, typer, umbra, vernacular, woo, wookie, xmlemitter, youtube.
To get this update, use: (ql:updatedist "quicklisp")
30 Nov 2019 9:09pm GMT