14 Dec 2019

feedPlanet Lisp

Zach Beane: Exclusive file open

The Planet Lisp twitter fix involved tracking post status with a file. Although it's not 100% bulletproof, there's a trick when using files to reduce races.

Here's a naive approach:

  1. Assign a Planet Lisp post a unique pathname
  2. If that pathname exists, do nothing
  3. Otherwise, send a tweet for the post and save its details to the pathname

The problem is between 2 and 3 - another process could post the tweet first.

There's another option:

  1. Assign a Planet Lisp post a unique pathname
  2. Attempt to open the file with O_EXCL|O_CREAT mode
  3. If the attempt succeeds, send a tweet for the post and save its details to the pathname

The key bit is O_EXCL|O_CREAT mode, a Unix option that means open will fail if the file exists, and succeed atomically otherwise. Specifically:

The check for the existence of the file and the creation of the file if it does not exist shall be atomic with respect to other threads executing open() naming the same filename in the same directory with O_EXCL and O_CREAT set.

In SBCL, you can get O_EXCL|O_CREAT semantics by specifing :direction :output and NIL, :error, or :new-version as the :if-exists argument to CL:OPEN. For NIL, you can check to see if the returned value is NIL, instead of a stream, to know if you have successfully exclusively opened the file and can proceed. I used the NIL approach in the Planet Lisp twitter gateway.

14 Dec 2019 1:18pm GMT

Zach Beane: Planet Lisp is back on Twitter

I brought the @planet_lisp Twitter account back to life after more than a year of dormancy. It updates twitter with every item on Planet Lisp.

It broke a while back because of how I used Twitter itself to keep track of what had been posted. I updated it to use a local data directory instead. It's working well so far.

14 Dec 2019 2:59am GMT

13 Dec 2019

feedPlanet Lisp

Vsevolod Dyomkin: Programming Algorithms: Dynamic Programming

This chapter opens the final part of the book entitled "Selected Algorithms". In it, we're going to apply the knowledge from the previous chapters in analyzing a selection of important problems that are mostly application-independent and find usages in many applied domains: optimization, synchronization, compression, and similar.

We will start with a single approach that is arguably the most powerful algorithmic technic in use. If we managed to reduce the problem to Dynamic Programming (DP), in most of the cases, we can consider it solved. The fact that we progressed so far in this book without mentioning DP is quite amazing. Actually, we could have already talked about it several times, especially in the previous chapter on strings, but I wanted to contain this topic to its own chapter so deliberately didn't start the exposition earlier. Indeed, strings are one of the domains where dynamic programming is used quite heavily, but the technic finds application in almost every area.

Also, DP is one of the first marketing terms in CS. When Bellman had invented, he wanted to use the then hyped term "programming" to promote his idea. This has, probably, caused more confusion over the years than benefit. In fact, a good although unsexy name for this technic сould be simply "filling the table" as the essence of the approach is an exhaustive evaluating of all variants with memoization of partial results (in a table) to avoid repetition of redundant computations. Obviously, it will have any benefits only when there are redundant computations, which is not the case, for example, with combinatorial optimization. To determine if a problem may be solved with DP we need to validate that it has the optimal substructure property:

A problem has optimal substructure if when we take its subproblem an optimal solution to the whole problem includes an optimal solution to this subproblem.

An example of the optimal substructure is the shortest path problem. If the shortest path from point A to point B passes through some point C and there are multiple paths from C to B, the one included in the shortest path A-B should be the shortest of them. In fact, the shortest path is an archetypical DP problem which we'll discuss later in this chapter. A counterexample is a Travelling Salesman Problem (TSP): if it had optimal substructure the subpath between any two nodes in the result path should have been the shortest possible path between these nodes. But it isn't true for all nodes because it can't be guaranteed that the edges of the path will form a cycle with all the other shortest paths.

Fibonacci Numbers

So, as we said, the essence of DP is filling a table. This table, though, may have a different number of dimensions for different problems. Let's start with a 1d case. What book on algorithms can omit discussing the Fibonacci numbers? Usually, they are used to illustrate recursion, yet they are also a great showcase for the power of memoization. Besides, recursion is, conceptually, also an integral part of DP.

A naive approach to calculating the i-th number will be directly coding the Fibonacci formula:

(defun naive-fib (i)
(assert (typep i '(integer 0)))
(if (< i 2) 1
(+ (naive-fib (- i 1))
(naive-fib (- i 2)))))

However, applying it will result in an exponential growth of the number of computations: each call to naive-fib results in two more calls. So, the number of calls needed for the n-th number, with this approach, is O(2^n).

> (time (naive-fib 40))
Evaluation took: 3.390 seconds of real time
165580141
> (time (naive-fib 42))
Evaluation took: 7.827 seconds of real time
433494437

Yet, we can see here a direct manifestation of an optimal substructure property: the i-th number calculation uses the result of the (1- i)-th one. To utilize this recurrence, we'll need to store the previous results and reuse them. It may be achieved by changing the function call to the table access. Actually, from the point of view of math, tables and functions are, basically, the same thing.

(let ((fib (vec 1 1)))  ; our table will be an adjustable vector
(defun fib (i)
(when (< (length fib) i)
(vector-push-extend (fib (- i 1)) fib))
(+ (? fib (- i 1))
(? fib (- i 2)))))

What we've done here is added a layer of memoization to our function that uses an array fib that is filled with the consecutive Fibonacci numbers. The array is hidden inside the closure of the fib procedure, so it will persist between the calls to it and accumulate the numbers as they are requested. There will also be no way to clear it, apart from redefining the function, as the closed over variables of this kind are not accessible outside of the function. The consecutive property is ensured by the arrangement of the recursive calls: the table is filled on the recursive ascent starting from the lowest yet unknown number. This approach guarantees that each Fibonacci number is calculated exactly once and reduces our dreaded O(2^n) running time to a mere O(n)!

Such a calculation is the simplest example of top-down DP that is performed using recursion. Despite its natural elegance, it suffers from a minor problem that may turn significant, in some cases: extra space consumption by each recursive call. It's not only O(n) in time, but also in space. The alternative strategy that gets rid of redundant space usage is called bottom-up DP and is based on loops instead of recursion. Switching to it is quite trivial, in this case:

(let ((fib (vec 1 1)))
(defun bottom-up-fib (i)
(let ((off (length fib)))
(adjust-array fib (1+ i) :fill-pointer t)
(dotimes (j (- (1+ i) off))
(let ((j (+ j off)))
(:= (aref fib j)
(+ (aref fib (- j 1))
(aref fib (- j 2)))))))
(aref fib i)))
> (time (bottom-up-fib 42))
Evaluation took: 0.000 seconds of real time
> (time (bottom-up-fib 4200))
Evaluation took: 0.004 seconds of real time
40512746637826407679504078155833145442086707013857032517543... ; this number is a Lisp's bignum that has unlimited size

Funny enough, a real-word-ready implementation of Fibonacci numbers ends up not using recursion at all...

String Segmentation

Let's consider another 1d problem: suppose we have a dictionary of words and a string consisting of those words that somehow lost the spaces between them - the words got glued together. We need to restore the original string with spaces or, to phrase it differently, split the string into words. This is one of the instances of string segmentation problems, and if you're wondering how and where such a situation could occur for real, consider Chinese text that doesn't have to contain spaces. Every Chinese language processing system needs to solve a similar task.

Here's an example input[1]:

String: thisisatest
Dictionary: a, i, s, at, is, hi, ate, his, sat, test, this
Expected output: this is a test

It is clear that even with such a small dictionary there are multiple ways we could segment the string. The straightforward and naive approach is to use a greedy algorithm. For instance, a shortest-first solution will try to find the shortest word from the dictionary starting at the current position and then split it (as a prefix) from the string. It will result in the following split: this i sat est. But the last part est isn't in the dictionary, so the algorithm has failed to produce some of the possible correct splits (although, by chance, if the initial conditions where different, it could have succeeded). Another version - the longest-first approach - could look for the longest words instead of the shortest. This would result in: this is ate st. Once again the final token is not a word. It is pretty obvious that these simple takes are not correct and we need a more nuanced solution.

As a common next step in developing such brute force approaches a developer would resort to backtracking: when the computation reaches the position in the string, from which no word in the dictionary may be recovered, it unwinds to the position of the previous successful split and tries a different word. This procedure may have to return multiple steps back - possibly to the very beginning. As a result, in the worst case, to find a correct split, we may need to exhaustively try all possible combinations of words that fit into the string.

Here's an illustration of the recursive shortest-first greedy algorithm operation:

(defun shortest-first-restore-spaces (dict str)
(dotimes (i (length str))
(let ((word (slice str 0 (1+ i))))
(when (? dict word)
(return-from shortest-first-restore-spaces
(cond-it
((= (1+ i) (length str))
word)
((shortest-first-restore-spaces dict (slice str (1+ i)))
(format nil "~A ~A" word it))))))))

CL-USER> (defparameter *dict* (hash-set 'equal "a" "i" "at" "is" "hi" "ate" "his" "sat" "test" "this"))
CL-USER> (shortest-first-restore-spaces *dict* "thisisatest")
0: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")
1: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")
2: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")
3: (SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")
3: SHORTEST-FIRST-RESTORE-SPACES returned NIL
2: SHORTEST-FIRST-RESTORE-SPACES returned NIL
1: SHORTEST-FIRST-RESTORE-SPACES returned NIL
0: SHORTEST-FIRST-RESTORE-SPACES returned NIL
NIL

To add backtracking into the picture, we need to avoid returning in the case of the failure of the recursive call:

(defun bt-shortest-first-restore-spaces (dict str)
(dotimes (i (length str))
(let ((word (slice str 0 (1+ i))))
(when (in# word dict)
(when (= (1+ i) (length str))
(return-from bt-shortest-first-restore-spaces word))
(when-it (bt-shortest-first-restore-spaces dict (slice str (1+ i)))
(return-from bt-shortest-first-restore-spaces (format nil "~A ~A" word it)))))))

CL-USER> (bt-best-first-restore-spaces *dict* "thisisatest")
0: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "thisisatest")
1: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "isatest")
2: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "satest")
3: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "est")
3: BT-SHORTEST-FIRST-RESTORE-SPACES returned NIL
2: BT-SHORTEST-FIRST-RESTORE-SPACES returned NIL
;; backtracking kicks in here
2: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "atest")
3: (BT-SHORTEST-FIRST-RESTORE-SPACES #<HASH-TABLE :TEST EQUAL :COUNT 10 {101B093953}> "test")
3: BT-SHORTEST-FIRST-RESTORE-SPACES returned "test"
2: BT-SHORTEST-FIRST-RESTORE-SPACES returned "a test"
1: BT-SHORTEST-FIRST-RESTORE-SPACES returned "is a test"
0: BT-SHORTEST-FIRST-RESTORE-SPACES returned "this is a test"
"this is a test"

Lisp trace is an invaluable tool to understand the behavior of recursive functions. Unfortunately, it doesn't work for loops, with which one has to resort to debug printing.

Realizing that this is brute force, we could just as well use another approach: generate all combinations of words from the dictionary of the total number of characters (n) and choose the ones that match the current string. The exact complexity of this scheme is O(2^n)[2]. In other words, our solution leads to a combinatorial explosion in the number of possible variants - a clear no-go for every algorithmic developer.

So, we need to come up with something different, and, as you might have guessed, DP fits in perfectly as the problem has the optimal substructure: a complete word in the substring of the string remains a complete word in the whole string as well. Based on this understanding, let's reframe the task in a way that lends itself to DP better: find each character in the string that ends a complete word so that all the words combined cover the whole string and do not intersect[3].

Here is an implementation of the DP-based procedure. Apart from calculating the maximum length of a word in the dictionary, which usually may be done offline, it requires single forward and backward passes. The forward pass is a linear scan of the string that at each character tries to find all the words starting at it and matching the string. The complexity of this pass is O(n * w), where w is the constant length of the longest word in the dictionary, i.e. it is, actually, O(n). The backward pass (called, in the context of DP, decoding) restores the spaces using the so-called backpointers stored in the dp array. Below is a simplistic implementation that returns a single match. A recursive variant is possible with or without a backward pass that will accumulate all the possible variants.

(defun dp-restore-spaces (dict str)
(let ((dp (make-array (1+ (length str)) :initial-element nil))
;; in the production implementation, the following calculation
;; should be performed at the pre-processing stage
(w (reduce 'max (mapcar 'length (keys dict))))
(begs (list))
(rez (list)))
;; the outer loop tries to find the next word
;; only starting from the ends of the words that were found previously
(do ((i 0 (pop begs)))
((or (null i)
(= i (length str))))
;; the inner loop checks all substrings of length 1..w
(do ((j (1+ i) (1+ j)))
((>= j (1+ (min (length str)
(+ w i)))))
(when (? dict (slice str i j))
(:= (? dp j) i)
(push j begs)))
(:= begs (reverse begs)))
;; the backward pass
(do ((i (length str) (? dp i)))
((null (? dp i)))
(push (slice str (? dp i) i) rez))
(strjoin #\Space rez)))

CL-USER> (dp-restore-spaces *dict* "thisisatest")
"this is a test"

Similarly to the Fibonacci numbers, the solution to this problem doesn't use any additional information to choose between several variants of a split; it just takes the first one. However, if we wanted to find the variant that is most plausible to the human reader, we'd need to add some measure of plausibility. One idea might be to use a frequency dictionary, i.e. prefer the words that have a higher frequency of occurrence in the language. Such an approach, unfortunately, also has drawbacks: it overemphasizes short and frequent words, such as determiners, and also doesn't account for how words are combined in context. A more advanced option would be to use a frequency dictionary not just of words but of separate phrases (ngrams). The longer the phrases are used, the better from the standpoint of linguistics, but also the worse from the engineering point of view: more storage space needed, more data to process if we want to collect reliable statistics for all the possible variants. And, once again, with the rise of the number of words in an ngram, we will be facing the issue of combinatorial explosion petty soon. The optimal point for this particular task might be bigrams or trigrams, i.e. phrases of 2 or 3 words. Using them, we'd have to supply another dictionary to our procedure and track the measure of plausibility of the current split as a product of the frequencies of the selected ngrams. Formulated this way, our exercise becomes not merely an algorithmic task but an optimization problem. And DP is also suited to solving such problems. In fact, that was the primary purpose it was intended for, in the Operations Research community. We'll see it in action with our next problem - text justification. And developing a restore-spaces-plausibly procedure is left as an exercise to the reader. :)

Text Justification

The task of text justification is relevant to both editing and reading software: given a text, consisting of paragraphs, split each paragraph into lines that contain whole words only with a given line length limit so that the variance of line lengths is the smallest. Its solution may be used, for example, to display text in HTML blocks with an align=justify property.

A more formal task description would be the following:

As we are discussing this problem in the context of DP, first, we need to determine what is its optimal substructure. Superficially, we could claim that lines in the optimal solution should contain only the lines that have the smallest penalty, according to the formula. However, this doesn't work as some of the potential lines that have the best plausibility (length closest to 80 characters) may overlap, i.e. the optimal split may not be able to include all of them. What we can reliably claim is that, if the text is already justified from position 0 to i, we can still justify the remainder optimally regardless of how the prefix is split into lines. This is, basically, the same as with string segmentation where we didn't care how the string was segmented before position i. And it's a common theme in DP problems: the key feature that allows us to save on redundant computation is that we only remember the optimal result of the computation that led to a particular partial solution, but we don't care about what particular path was taken to obtain it (except we care to restore the path, but that's what the backpointers are for - it doesn't impact the forward pass of the algorithm). So the optimal substructure property of text justification is that if the best split of the whole string includes the consecutive indices x and y, then the best split from 0 to y should include x.

Let's justify the following text with a line limit of 50 chars:

Common Lisp is the modern, multi-paradigm, high-performance, compiled, ANSI-standardized,
most prominent descendant of the long-running family of Lisp programming languages.

Suppose we've already justified the first 104 characters. This leaves us with a suffix that has a length of 69: descendant of the long-running family of Lisp programming languages. As its length is above 50 chars, but below 100, so we can conclude that it requires exactly 1 split. This split may be performed after the first, second, third, etc. token. Let's calculate the total plausibility of each candidate:

after "the": 5832 + 0 = 5832
after "long-running": 6859 + 2197 = 9056
after "family": 1728 + 8000 = 9728
after "of": 729 + 12167 = 12896
after "Lisp": 64 + 21952 = 22016

So, the optimal split starting at index 105[4] is into strings: "descendant of the" and "long-running family of Lisp programming languages." Now, we haven't guaranteed that index 105 will be, in fact, the point in the optimal split of the whole string, but, if it were, we would have already known how to continue. This is the key idea of the DP-based justification algorithm: starting from the end, calculate the cost of justifying the remaining suffix after each token using the results of previous calculations. At first, while suffix length is below line limit they are trivially computed by a single call to the plausibility function. After exceeding the line limit, the calculation will consist of two parts: the plausibility penalty + the previously calculated value.

(defun justify (limit str)
(with ((toks (reverse (split #\Space str)))
(n (length toks))
(penalties (make-array n))
(backptrs (make-array n))
(lengths (make-array n)))
;; forward pass (from the end of the string)
(doindex (i tok toks)
(let ((len (+ (length tok) (if (> i 0)
(? lengths (1- i))
0))))
(:= (? lengths i) (1+ len))
(if (<= len limit)
(:= (? penalties i) (penalty len limit)
(? backptrs i) -1)
;; minimization loop
(let ((min most-positive-fixnum)
arg)
(dotimes (j i)
(with ((j (- i j 1))
(len (- (? lengths i)
(? lengths j)))
(penalty (+ (penalty len limit)
(? penalties j))))
(when (> len limit) (return))
(when (< penalty min)
(:= min penalty
arg j))))
(:= (? penalties i) min
(? backptrs i) arg)))))
;; backward pass (decoding)
(loop :for end := (1- n) :then beg
:for beg := (? backptrs end)
:do (format nil "~A~%" (strjoin #\Space (reverse (subseq toks (1+ beg) (1+ end)))))
:until (= -1 beg))))

CL-USER> (justify 50 "Common Lisp is the modern, multi-paradigm, high-performance, compiled, ANSI-standardized,
most prominent descendant of the long-running family of Lisp programming languages.")

Common Lisp is the modern, multi-paradigm,
high-performance, compiled, ANSI-standardized,
most prominent descendant of the long-running
family of Lisp programming languages.

This function is somewhat longer, but, conceptually, it is pretty simple. The only insight I needed to implement it efficiently was the additional array for storing the lengths of all the string suffixes we have examined so far. This way, we apply memoization twice: to prevent recalculation of both the penalties and the suffix lengths, and all of the ones we have examined so far are used at each iteration. If we were to store the suffixes themselves we would have had to perform an additional O(n) length calculation at each iteration.

The algorithm performs two passes. In the forward pass (which is, in fact, performed from the end), it fills the slots of the DP arrays using the minimum joint penalty for the potential current line and the remaining suffix, the penalty for which was calculated during one of the previous iterations of the algorithm. In the backward pass, the resulting lines are extracted by traversing the backpointers starting from the last index.

The key difference from the previous DP example are these lines:

(:= (? penalties i) min
(? backptrs i) arg)

Adding them (alongside with the whole minimization loop) turns DP into an optimization framework that, in this case, is used to minimize the penalty. The backptrs array, as we said, is used to restore the steps which have lead to the optimal solution. As, eventually (and this is true for the majority of the DP optimization problems), we care about this sequence and not the optimization result itself.

As we can see, for the optimization problems, the optimal substructure property is manifested as a mathematical formula called the recurrence relation. It is the basis for the selection of a particular substructure among several variants that may be available for the current step of the algorithm. The relation involves an already memoized partial solution and the cost of the next part we consider adding to it. For text justification, the formula is the sum of the current penalty and the penalty of the newly split suffix. Each DP optimization task is based on a recurrence relation of a similar kind.

Now, let's look at this problem from a different perspective. We can represent our decision space as a directed acyclic graph. Its leftmost node (the "source") will be index 0, and it will have several direct descendants: nodes with those indices in the string, at which we can potentially split it not exceeding the 50-character line limit, or, alternatively, each substring that spans from index 0 to the end of some token and is not longer than 50 characters. Next, we'll connect each descendant node in a similar manner with all nodes that are "reachable" from it, i.e. they have a higher value of associated string position, and the difference between their index and this node is below 50. The final node of the graph ("sink") will have the value of the length of the string. The cost of each edge is the value of the penalty function. Now, the task is to find the shortest path from source to sink.

Here is the DAG for the example string with the nodes labeled with the indices of the potential string splits. As you can see, even for such a simple string, it's already quite big, what to speak of real texts. But it can provide some sense of the number of variants that an algorithm has to evaluate.

What is the complexity of this algorithm? On the surface, it may seem to be O(m^2) where m is the token count, as there are two loops: over all tokens and over the tail. However, the line (when (> len limit) (return)) limits the inner loop to only the part of the string that can fit into limit chars, effectively, reducing it to a constant number of operations (not more than limit, but, in practice, an order of magnitude less). Thus, the actual complexity is O(m)[5].

Pathfinding Revisited

In fact, any DP problem may be reduced to pathfinding in the graph: the shortest path, if optimization is involved, or just any path otherwise. The nodes in this graph are the intermediate states (for instance, a split at index x or an i-th Fibonacci number) and the edges - possible transitions that may bear an associated cost (as in text justification) or not (as in string segmentation). And the classic DP algorithm to solve the problem is called the Bellman-Form algorithm. Not incidentally, one of its authors, Bellman is the "official" inventor of DP.

(defun bf-shortest-path (g)
(with ((n (array-dimension g 0))
(edges (edges-table g))
(dists (make-array n :initial-elment most-positive-fixnum))
(backptrs (make-array n))
(path (list)))
(:= (? dists (1- n)) 0)
(dotimes (v (vertices g))
(dotimes (e (? edges v))
(with ((u (src e))
(dist (+ (dist e)
(? dists u))))
(when (< dist (? dists u))
(:= (? dists u) dist
(? backptrs u) v)))))
(loop :for v := (1- n) :then (? backptrs v) :do
(push v path))
(values path
(? dists (1- n)))))

The code for the algorithm is very straightforward, provided that our graph representation already has the vertices and edges as a data structure in convenient format or implements such operations (in the worst case, the overall complexity should be not greater than O(V+E)). For the edges, we need a kv indexed by the edge destination - an opposite to the usual representation that groups them by their sources[6].

Compared to text justification, this function looks simpler as we don't have to perform task-specific processing that accounts for character limit and spaces between words. However, if we were to use bf-shortest-path, we'd have to first create the graph data structure from the original text. So all that complexity would go into the graph creation routine. However, from the architectural point-of-views, such split may be beneficial as the pathfinding procedure could be reused for other problems.

One might ask a reasonable question: how does Bellman-Ford fare against the Dijkstra's algorithm (DA)? As we have already learned, Dijkstra's is a greedy and optimal solution to pathfinding, so why consider yet another approach? Both algorithms operate by relaxation, in which approximations to the correct distance are replaced by better ones until the final result is reached. And in both of them, the approximate distance to each vertex is always an overestimate of the true distance, and it is replaced by the minimum of its old value and the length of a newly found path. Turns out that DA is also a DP-based approach. But with additional optimizations! It uses the same optimal substructure property and recurrence relations. The advantage of DA is the utilization of the priority queue to effectively select the closest vertex that has not yet been processed. Then it performs the relaxation process on all of its outgoing edges, while the Bellman-Ford algorithm relaxes all the edges. This method allows BF to calculate the shortest paths not to a single node but to all of them (which is also possible for DA but will make its runtime, basically, the same as for BF). So, Bellman-Ford complexity is O(V E) compared to O(E + V logV) for the optimal implementation of DA. Besides, BF can account for negative edge weights, which will break DA.

So, DA remains the algorithm of choice for the standard shortest path problem, and it's worth keeping in mind that it can also be also applied as a solver for some DP problems if they are decomposed into graph construction + pathfinding. However, some DP problems have additional constraints that make using DA for them pointless. For example, in text justification, the number of edges to consider at each step is limited by a constant factor, so the complexity of the exhaustive search is, in fact, O(V). Proving that for our implementation of justify is left as an exercise to the reader...

LCS & Diff

Let's return to strings and the application of DP to them. The ultimate DP-related string problem is string alignment. It manifests in many formulations. The basic one is the Longest Common Subsequence (LCS) task: determine the length of the common part among two input strings. Solving it, however, provides enough data to go beyond that - it enables determining the best alignment of the strings, as well as to enumerating the edit operations needed to transform one string into another. The edit operations, which are usually considered in the context of LCS are:

Based on the number of those operations, we can calculate a metric of commonality between two strings that is called the Levenstein distance. It is one of the examples of the so-called Edit distances. The identical strings have a Levenstein distance of 0, and strings foobar and baz - of 4 (3 deletion operations for the prefix foo and a substitution operation of r into z). The are also other variants of edit distances. FOr instance, the Damerau-Levenstein distance that is better suited to compare texts with misspellings produced by humans, adds another modification operation: swap, which reduces the edit distance in the case of two adjacent characters being swapped to 1 instead of 2 for the Levenstein (1 deletion adn 1 insertion).

The Levenstein distance, basically, gives us for free the DP recurrence relations: when we consider the i-th character of the first string and the j-th one of the second, the edit distance between the prefixes 0,i and 0,j is either the same as for the pair of chars (1- i) and (1- j) respectively, if the current characters are the same, or 1+ the minimum of the edit distances of the pairs i (1- j), (1- i) (1- j), and (1-i) j.

We can encode this calculation as a function that uses a matrix for memoization. Basically, this is the DP solution to the LCS problem: now, you just have to subtract the length of the string and the bottom right element of the matrix, which will give you the measure of the difference between the strings.

(defun lev-dist (s1 s2 &optional 
(i1 (1- (length s1)))
(i2 (1- (length s2)))
(ld (make-array (list (1+ (length s1))
(1+ (length s2)))
:initial-element nil)
ldp)) ; a flag indicating that the argument was supplied
;; initialization of the 0-th column and row
(unless ldp
(dotimes (k (1+ (length s1))) (:= (aref ld k 0) 0))
(dotimes (k (1+ (length s2))) (:= (aref ld 0 k) 0)))
(values (or (aref ld (1+ i1) (1+ i2))
(:= (aref ld (1+ i1) (1+ i2))
(if (eql (? s1 i1) (? s2 i2))
(lev-dist s1 s2 (1- i1) (1- i2) ld)
(1+ (min (lev-dist s1 s2 (1- i1) (1- i2) ld)
(lev-dist s1 s2 i1 (1- i2) ld)
(lev-dist s1 s2 (1- i1) i2 ld))))))
ld))

However, if we want to also use this information to align the sequences, we'll have to make a reverse pass[]{{Here, a separate backpointers array isn't necessary as we can infer the direction by reversing the distance formula.}}.

(defun align (s1 s2)
(with ((i1 (length s1))
(i2 (length s2))
;; our Levenstein distance procedure returns the whole DP matrix
;; as a second value
(ld (nth-value 1 (lev-dist s1 s2)))
(rez (list)))
(loop
(let ((min (min (aref ld (1- i1) (1- i2))
(aref ld i1 (1- i2))
(aref ld (1- i1) i2))))
(cond ((= min (aref ld (1- i1) (1- i2)))
(push (pair (? s1 (1- i1)) (? s2 (1- i2)))
rez)
(:- i1)
(:- i2))
((= min (aref ld (1- i1) i2))
(push (pair (? s1 (1- i1)) nil)
rez)
(:- i1))
((= min (aref ld i1 (1- i2)))
(push (pair nil (? s2 (1- i2)))
rez)
(:- i2))))
(when (= 0 i1)
(loop :for j :from (1- i2) :downto 0 :do
(push (pair #\* (? s2 j)) rez))
(return))
(when (= 0 i2)
(loop :for j :from (1- i1) :downto 0 :do
(push (pair (? s1 j) nil) rez))
(return)))
;; pretty output formatting
(with-output-to-string (s1)
(with-output-to-string (s2)
(with-output-to-string (s3)
(loop :for (c1 c2) :in rez :do
(format s1 "~C " (or c1 #\.))
(format s2 "~C " (cond ((null c1) #\↓)
((null c2) #\↑)
((char= c1 c2) #\|)
(t #\x)))
(format s3 "~C " (or c2 #\.)))
(format t "~A~%~A~%~A~%"
(get-output-stream-string s1)
(get-output-stream-string s2)
(get-output-stream-string s3)))))
rez))

CL-USER> (align "democracy" "remorse")
d e m o c r a c y
x | | | ↑ | ↑ x x
r e m o . r . s e

CL-USER> (lev-dist "democracy" "remorse")
5
#2A((0 1 2 3 4 5 6 7)
(1 1 2 3 4 5 6 7)
(2 2 1 2 3 4 5 6)
(3 3 2 1 2 3 4 5)
(4 4 3 2 1 2 3 4)
(5 5 4 3 2 2 3 4)
(6 5 5 4 3 2 3 4)
(7 6 6 5 4 3 3 4)
(8 7 7 6 5 4 4 4)
(9 8 8 7 6 5 5 5))

It should be pretty clear how we can also extract the edit operations during the backward pass: depending on the direction of the movement, horizontal, vertical or diagonal, it's either an insertion, deletion or substitution. The same operations may be also grouped to reduce noise. The alignment task is an example of a 2d DP problem. Hence, the diff computation has a complexity of O(n^2). There are other notable algorithms, such as CYK parsing or the Viterbi algorithm, that also use a 2d array, although they may have higher complexity than just O(n^2). For instance, the CYK parsing is O(n^3), which is very slow compared to the greedy O(n) shift-reduce algorithm.

However, the diff we will obtain from the basic LCS computation will still be pretty basic. There are many small improvements that are made by production diff implementation both on the UX and performance sides. Besides, the complexity of the algorithm is O(n^2), which is quite high, so many practical variants perform many additional optimizations to reduce the actual number of operations, at least, for the common cases.

The simplest improvement is a preprocessing step that is warranted by the fact that, in many applications, the diff is performed on texts that are usually mostly identical and have a small number of differences between them localized in an even smaller number of places. For instance, consider source code management, where diff plays an essential role: the programmers don't tend to rewrite whole files too often, on the contrary, such practice is discouraged due to programmer collaboration considerations.

So, some heuristics may be used in the library diff implementations to speed up such common cases:

A perfect diff algorithm will report the minimum number of edits required to convert one text into the other. However, sometimes the result is too perfect and not very good for human consumption. People will expect operations parts to be separated at token boundaries when possible, also larger contiguous parts are preferred to an alteration of small changes. All these and other diff ergonomic issue may be addressed by various postprocessing tweaks.

But, besides these simple tricks, are global optimizations to the algorithm possible? After all, O(n^2) space and time requirements are still pretty significant. Originally, diff was developed for Unix by Hunt and McIlroy. Their approach computes matches in the whole file and indexes them into the so-called k-candidates, k being the LCS length. The LCS is augmented progressively by finding matches that fall within proper ordinates (following a rule explained in their paper). While doing this, each path is memoized. The problem with the approach is that it performs more computation than necessary: it memoizes all the paths, which requires O(n^2) memory in the worst case, and O(n^2 log n) for the time complexity!

The current standard approach is the divide-and-conquer Myers algorithm. It works by finding recursively the central match of two sequences with the smallest edit script. Once this is done only the match is memoized, and the two subsequences preceding and following it are compared again recursively by the same procedure until there is nothing more to compare. Finding the central match is done by matching the ends of subsequences as far as possible, and any time it is not possible, augmenting the edit script by 1 operation, scanning each furthest position attained up to there for each diagonal and checking how far the match can expand. If two matches merge, the algorithm has just found the central match. This approach has the advantage to using only O(n) memory, and executes in O(n d), where d is the edit script complexity (d is less than n, usually, much less). The Myers algorithm wins because it does not memoize the paths while working, and does not need to "foresee" where to go. So, it can concentrate only on the furthest positions it could reach with an edit script of the smallest complexity. The smallest complexity constraint ensures that what is found in the LCS. Unlike the Hunt-McIlroy algorithm, the Myers one doesn't have to memoize the paths. In a sense, the Myers algorithm compared to the vanilla DP diff, like the Dijkstra's one versus Bellman-Ford, cuts down on the calculation of the edit-distances between the substring that don't contribute to the optimal alignment. While solving LCS and building the whole edit-distance matrix performs the computation for all substrings.

The diff tool is a prominent example of a transition from quite an abstract algorithm to a practical utility that is is an essential part of many ubiquitous software products, and the additional work needed to ensure that the final result is not only theoretically sane but also usable.

P.S. Ever wondered how github and other tools, when displaying the diff, not only show the changed line but also highlight the exact changes in the line? The answer is given in [7].

DP in Action: Backprop

As we said in the beginning, DP has applications in many areas: from Machine Learning to graphics to Source Code Management. Literally, you can find an algorithm that uses DP in every specialized domain, and if you don't - this means you, probably, can still advance this domain and create something useful by applying DP to it. Deep Learning is the fastest developing area of the Machine Learning domain, in recent years. At its core, the discipline is about training huge multilayer optimization functions called "neural networks". And the principal approach to doing that, which, practically speaking, has enabled the rapid development of machine learning techniques that we see today, is the Backpropagation (backprop) optimization algorithm.

As pointed out by Christopher Olah, for modern neural networks, it can make training with gradient descent as much as ten million times faster, relative to a naive implementation. That's the difference between a model taking a week to train and taking 200,000 years. Beyond its use in deep learning, backprop is a computational tool that may be applied in many other areas, ranging from weather forecasting to analyzing numerical stability - it just goes by different names there. In fact, the algorithm has been reinvented at least dozens of times in different fields. The general, application-independent, name for it is Reverse-Mode Differentiation. Essentially, it's a technic for calculating partial derivatives quickly using DP on computational graphs.

Computational graphs are a nice way to think about mathematical expressions. For example, consider the expression (:= e (* (+ a b) (1+ b))). There are four operations: two additions, one multiplication, and an assignment. Let's arrange those computations in the same way they would be performed on the computer:

(let ((c (+ a b))
(d (1+ b)))
(:= e (* c d)))

To create a computational graph, we make each of these operations, along with the input variables, into nodes. When the outcome of one expression is an input to another one, a link points from one node to another:

We can evaluate the expression by setting the values in the input nodes (a and b) to certain values and computing nodes in the graph along the dependency paths. For example, let's set a to 2 and b to 1: the result in node e will be, obviously, 6.

The derivatives in a computational graph can be thought of as edge labels. If a directly affects c, then we can write a partial derivative ∂c/∂a along the edge from a to c.

Here is the computational graph with all the derivatives for the evaluation with the values of a and b set to 2 and 1.

But what if we want to understand how nodes that aren't directly connected affect each other. Let's consider how e is affected by a. If we change a at a speed of 1, c also changes at a speed of 1. In turn, c changing at a speed of 1 causes e to change at a speed of 2. So e changes at a rate of (* 1 2) with respect to a. The general rule is to sum over all possible paths from one node to the other, multiplying the derivatives on each edge of the path together. We can see that this graph is, basically, the same as the graph we used to calculate the shortest path.

This is where Forward-mode differentiation and Reverse-mode differentiation come in. They're algorithms for efficiently computing the sum by factoring the paths. Instead of summing over all of the paths explicitly, they compute the same sum more efficiently by merging paths back together at every node. In fact, both algorithms touch each edge exactly once. Forward-mode differentiation starts at an input to the graph and moves towards the end. At every node, it sums all the paths feeding in. Each of those paths represents one way in which the input affects that node. By adding them up, we get the total derivative. Reverse-mode differentiation, on the other hand, starts at an output of the graph and moves towards the beginning. At each node, it merges all paths which originated at that node. Forward-mode differentiation tracks how one input affects every node. Reverse-mode differentiation tracks how every node affects one output.

So, what if we do reverse-mode differentiation from e down? This gives us the derivative of e with respect to every node. Forward-mode differentiation gave us the derivative of our output with respect to a single input, but reverse-mode differentiation gives us all of the derivatives we need in one go. When training neural networks, the cost is a function of the weights of each edge. And using reverse-mode differentiation (aka backprop), we can calculate the derivatives of the cost with respect to all the weights in a single pass through the graph, and then feed them into gradient descent. As there are millions and tens of millions of weights, in a neural network, reverse-mode differentiation results in a speedup of the same factor!

Backprop is an example of simple memoization DP. No selection of the best variant is needed, it's just a proper arrangement of the operations to avoid redundant computations.

Take-aways

DP-based algorithms may operate on one of these three levels:

If we want to apply DP to some task, we need to find its optimal substructure: i.e. verify that an optimal solution to a subproblem will remain a part of the optimal solution to the whole problem. Next, if we deal with an optimization task, we may have to formulate the recurrence relations. After that, it's just a matter of technic: those relations may be either programmed directly as a recursive or iterative procedure (like in LCS) or indirectly using the method of consecutive approximations (like in Bellman-Ford).

Ultimately, all DP problems may be reduced to pathfinding in the graph, but it doesn't always make sense to have this graph explicitly as a data structure in the program. If it does, however, remember that Dijkstra's algorithm is the optimal algorithm to find a single shortest path in it.

DP, usually, is a reasonable next thing to think about after the naive greedy approach (which, let's be frank, everyone tends to take initially) stumbles over backtracking. However, we saw that DP and greedy approaches do not contradict each other: in fact, they can be combined as demonstrated by the Dijkstra's algorithm. Yet, an optimal greedy algorithm is more of an exception than a rule. Although, there is a number of problems for which a top-n greedy solution (the so-called Beam search) can be a near-optimal solution that is good enough.

Also, DP doesn't necessarily mean optimal. A vanilla dynamic programming algorithm exhaustively explores the decision space, which may be excessive in many cases. It is demonstrated by the examples of the Dijkstra's and Myers algorithms that improve on the DP solution by cutting down some of the corners.

P.S. We have also discussed, the first time in this book, the value of heuristic pre- and postprocessing. From the theoretical standpoint, it is not something you have to pay attention to, but, in practice, that's a very important aspect of the production implementation of many algorithms and, thus, shouldn't be frowned upon or neglected. In an ideal world, an algorithmic procedure should both have optimal worst-case complexity and the fastest operation in the common cases.


[1] If you wonder, s is a word that is usually present in English programmatic dictionaries because when it's and friends are tokenized they're split into two tokens, and the apostrophe may be missing sometimes. Also, our dictionary is case-insensitive.

[2] The intuition for it is the following: in the worst case, every character has two choices: either to be the last letter of the previous word or the first one of the next word, hence the branching factor is 2.

[3] Actually, the condition of complete string coverage may be lifted, which will allow to use almost the same algorithm but skip over "undictionary" words like misspellings.

[4] A space at the end of the line is discarded.

[5] Provided all the length calculations are implemented efficiently. For simplicity, I have used plain lists here with a linear length complexity, but a separate variable may be added to avoid the extra cost.

[6] However, if we think of it, we could reuse the already proven linked representation just putting the incoming edges into the node structure instead of the outgoing ones.

[7] It runs diff twice: first, at the line-level (using each line as a single unit/token) and then at the character level, as you would normally expect. Then the results are just combined.

13 Dec 2019 10:40am GMT

12 Dec 2019

feedPlanet Lisp

Joe Marshall: I think you left something out

I recently had the pleasure of working with some auto-generated Java code. All the accessors and mutators were auto-generated, so was the predicate. The only thing missing was the constructor. You could manipulate one of these objects in any way you desired, but there was no easy way to instantiate one in the first place. I eventually figured out a way to instantiate one through reflection. I created a JSON version of the object and deserialized it. Crude, but effective.

I ran into a similar problem where I needed an instance of an object, but some idiot decided to make all the constructors protected. In this case, I created an unwanted subclass with a public constructor, called that instead, and then upcast to the object I really wanted.

Of course I was using these objects in a way the author of the code did not intend they be used, but the author wasn't omniscient and didn't foresee my use case (which was testing). By trying to prevent unintended uses, he simply forced me to write nasty code to get around his ultimately ineffective roadblocks. The author could have spent his time more productively creating a library with a more flexible API to handle unforeseen uses rather than inventing useless impediments to getting the job done.


I still cannot seem to reply to comments. I type a nice reply, hit the "publish" button, and the reply disappears, never to be seen again.

I'm working at a company that requires 100% branch coverage in tests. This requires teasing white-box libraries into returning malformed replies to cover all the branches in the error handler - even the ones that cannot actually be taken because the library doesn't actually generate malformed replies. So I have to supply the malformed replies myself, and it isn't easy if there aren't data constructors. We don't have AspectJ, and the source code isn't always available. And even when it is, it's simply not a good idea to modify it so that it is capable of generating errors it wouldn't otherwise be able to. This leaves me to resort to the hacks.


I was wrong. I just stumbled across the first example. It turns out that the constructor was generated, it just had no arguments. Now this is not unusual in Java (although I consider it bad form. You should be able to construct and initialize object in one step rather than have to construct an skeleton object and smash in the field values. This reduces your chances of creating partially initialized objects.) What made this a problem is that the generated code had no mutators, so you could only create skeleton objects, you couldn't put any values in them. I did remember correctly that I deserialized a JSON version to create an object with actual values in it. This seemed a better option than directly using reflection because the JSON object "looked" like the resulting objecct I wanted.

I'm no fan of just making every field mutable by default - just the opposite, in fact. Objects should usually be immutable. But you ought to have a way to create an initialized object without resorting to reflection.

12 Dec 2019 5:36pm GMT

10 Dec 2019

feedPlanet Lisp

Joe Marshall: (map append elements)

I can recall the difficulty I had understanding let expressions in Lisp. I used to write out the equivalent lambda expression and then manually add the syntactic sugar to make it into a let. And I don't recall having an easy time with the chapter on linked lists. So I find it amusing to see what tortuous code students come up with when they are first introduced to linked lists. My favorite so far has been (map append elements), which looks like it might join the sublists in elements, but if you think about it, the arity is all wrong: map doesn't apply its function, it calls it. This had me scratching my head for several minutes until I realized you can call apply on a single argument.

So what little gems of code have you discovered that perform the most trivial of tasks in the most convoluted or obscure manner?

10 Dec 2019 4:56pm GMT

06 Dec 2019

feedPlanet Lisp

Joe Marshall: "No 'x' in 'Nixon'"

I just had a phone screen where they asked if I could write a palindrome detector. I wrote a recursive and an iterative version. I like the recursive solution because it is so simple. There are only three equations:

I wrote them the iterative version so I wouldn't look like an idiot. The iterative version is generally faster with a standard compiler, but it is more complex. Not much more complex, but still, you have to keep track of two indexes into the string and how close they've gotten to each other.

The recursive solution is just more elegant.


Apparently I cannot post comments on my own blog. Well, I can edit the post. "if s[pos] != s[-pos-1]" sure looks like two pointers, but to be fair, the recursive solution hides its pointers in the string structure or in the slices if it is using them. As for "(string= x (string-reverse x))", there is something elegant in something that crude.

06 Dec 2019 9:44pm GMT

05 Dec 2019

feedPlanet Lisp

Zach Beane: Working with letsencrypt's certbot for a Lisp webserver

Every 90 days my letsencrypt certificate expires and I renew it manually. I have to cut and paste data from the certbot script into repl forms to serve the right data on the right path and it's such a pain that sometimes I put it off until after it expires, and people email me about it, and I feel bad.

The manual process looks something like this (not my actual domain or challenges):

# certbot certonly --manual -d my.example.com

Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Cert is due for renewal, auto-renewing...
Renewing an existing certificate
Performing the following challenges:
http-01 challenge for my.example.com

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NOTE: The IP of this machine will be publicly logged as having requested this
certificate. If you're running certbot in manual mode on a machine that is not
your server, please ensure you're okay with that.

Are you OK with your IP being logged?
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(Y)es/(N)o: y

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create a file containing just this data:

uaut.gj61B3f7oYQcWZSF4kxS4OFh8KQlsDVtPXrw60Tkj2JLW7RtZaVE0MIWwiEKRlxph7SaLwp1ETFjaGDUKN

And make it available on your web server at this URL:

http://my.example.com/.well-known/acme-challenge/SpZa7sf4QMEFoF7lKh7aT4EZjNWSVcur2jODPQkgExa

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Press Enter to Continue

There's no way that letsencrypt's certbot program will automatically get support for my custom hunchentoot "framework" and I don't really want to look into adding it myself.

For a while I thought about writing some Expect functionality in SBCL - sb-ext:run-program already supports a :pty option so it wouldn't be super-difficult. With a theoretical cl-expect you could spawn certbot with the right options, slurp out the verification secret and URL via the process's standard output stream, and call whatever code you need to serve the secret at that location.

I realized today there's a halfway solution that takes the worst cutting and pasting out of the loop. The unix command script -flush starts an interactive session where all input and output is saved to a file. My Lisp webserver can then read certbot program output from that file and configure the webserver automagically for me. I still have to manually start certbot and enter a few REPL commands, but it's easier than before.

Here's the core of Lisp side of things for processing the script file:

(defun scrape-letsencrypt-interaction (file)
  (let (secret path)
    (with-open-file (stream file)
      (labels (...)
        (loop
          (skip-past "Create a file containing")
          (next-line)
          (setf secret (read-trimmed-line))
          (skip-past "make it available")
          (next-line)
          (setf path (substring-after ".well-known" (read-trimmed-line))))))))

This could be done just as well (perhaps with cl-ppcre, but I didn't want to pull it in as dependency.

Here are the functions from labels:

((next-line ()
   (let ((line (read-line stream nil)))
     (when (null line)
       (unless (and secret path)
     (error "Didn't find a secret and path anywhere in ~S"
            file))
       (return-from scrape-letsencrypt-interaction
         (values secret path)))
     line))
 (skip-past (string)
   (loop
     (let ((line (next-line)))
       (when (search string line)
         (return)))))
 (read-trimmed-line ()
   (string-trim '(#\Return)
                (next-line)))
 (substring-after (string target)
   (let ((pos (search string target)))
     (unless pos
       (error "Could not find ~S in ~S" string target))
     (subseq target pos))))

The goal here is to only look at the last secret and URL in the script output, so the main loop keeps track of what it's seen so far and next-line returns those values at EOF. The output also has ugly trailing ^M noise so read-trimmed-line takes care of that.

Here's the whole thing all together:

(defun scrape-letsencrypt-interaction (file)
  (let (secret path)
    (with-open-file (stream file)
      (labels ((next-line ()
                 (let ((line (read-line stream nil)))
                   (when (null line)
                     (unless (and secret path)
                       (error "Didn't find a secret and path anywhere in ~S"
                              file))
                     (return-from scrape-letsencrypt-interaction
                       (values secret path)))
                   line))
               (skip-past (string)
                 (loop
                   (let ((line (next-line)))
                     (when (search string line)
                       (return)))))
               (read-trimmed-line ()
                 (string-trim '(#\Return)
                              (next-line)))
               (substring-after (string target)
                 (let ((pos (search string target)))
                   (unless pos
                     (error "Could not find ~S in ~S" string target))
                   (subseq target pos))))
        (loop
          (skip-past "Create a file containing")
          (next-line)
          (setf secret (read-trimmed-line))
          (skip-past "make it available")
          (next-line)
          (setf path (substring-after ".well-known" (read-trimmed-line))))))))

I don't mind shaving only half a yak when it feels like useful progress!

Someday I'll get around to a proper Expect-like thing…

05 Dec 2019 11:35pm GMT

02 Dec 2019

feedPlanet Lisp

Wimpie Nortje: HTTP Routing libraries for Hunchentoot

Index of picking libraries blog series

Hunchentoot is a web server and not a heavy weight web development framework. It provides methods to implement URL paths for static files, directory paths and with regex's, very much like the mainstream web servers. It does not provide an easy way to define routes for a REST API. When you use Hunchentoot without a web development framework you will probably want something to make route definition easier.

There are a few options for building REST APIs available in frameworks, Hunchentoot derivatives or other web servers but I wanted to implement REST routes with the original Hunchentoot web server. I found three libraries that can do this: simple-routes, Froute and easy-routes.

Simple-routes is the simplest and easiest to use. Routes are defined in a list similar to Hunchentoot's *dispatch-table*. It supports variables in the URL segments but there is no support for middleware1 type functionality.

Froute is the most powerful of the three. It is based on CLOS and is designed so it can be used with any web server although only a Hunchentoot connector is currently implemented. Routes are defined as CLOS classes and even though middleware is not a specific feature the documentation gives an example on how to use class inheritance to implement such functionality. The power of being CLOS based also makes this library the most complex to use.

Easy-routes has the concept of decorators which are functions that execute before the route body so it can be used to implement middleware functionality. Unlike Clack's middleware which are defined at a central place for all routes, decorators need to be applied to each route handler individually. It's not quite there, but close enough.

The lack of middleware options disqualified simple-routes for me and Froute looked like it provides everything I need, and more, but with much greater complexity than easy-routes. I decided to use easy-routes with the option to switch to Froute when I needed the extra capability.

Hunchentoot takes an "acceptor" argument at startup. Easy-routes provides two options: easy-routes-acceptor and routes-acceptor. Easy-routes-acceptor first executes all the route handlers and if no suitable handler is found it executes the normal Hunchentoot request handlers. The routes-acceptor executes only the route handlers and returns an 404 NOT FOUND error if no suitable handler is found.

I use routes-acceptor because it ensures that only requests with explicitly defined handlers are handled. With the easy-routes-acceptor it is too easy to create a security hole with some default Hunchentoot request handler that catches non-existent routes. It can be burdensome to use this approach for handling static files but I run Hunchentoot behind Nginx which also handles the static files.

The table summarises the main features I investigated:

simple-routes easy-routes Froute
Web server Hunchentoot Hunchentoot Hunchentoot (can be expanded to others)
REST routes Yes Yes Yes
Argument extraction from URL Yes Yes Yes
Dispatch based on HTTP method Yes Yes Yes
Middleware No Decorators CLOS inheritance
Fallback for undefined routes Hunchentoot easy-handler framework None or Hunchentoot easy-handler framework None
Learning curve Negligible Minimal Medium. Requires some CLOS knowledge.
Index of picking libraries blog series
  1. Middleware are functions that run before or after the main request handler. They are called as part of the request handling process and not by the handler. This makes them ideal to handle general functions like setting up a database connection, performing authorisation or any task which is not part of a particular request handler.

02 Dec 2019 12:00am GMT

01 Dec 2019

feedPlanet Lisp

Paul Khuong: A Multiset of Observations With Constant-time Sample Mean and Variance

Fixed notation issues in the "Faster multiset updates" section. Thank you Joonas.

Let's say you have a multiset (bag) of "reals" (floats or rationals), where each value is a sampled observations. It's easy to augment any implementation of the multiset ADT to also return the sample mean of the values in the multiset in constant time: track the sum of values in the multiset, as they are individually added and removed. This requires one accumulator and a counter for the number of observations in the multiset (i.e., constant space), and adds a constant time overhead to each update.

It's not as simple when you also need the sample variance of the multiset \(X\), i.e.,

\[\frac{1}{n - 1} \sum\sb{x \in X} (x - \hat{x})\sp{2},\]

where \(n = |X|\) is the sample size and \(\hat{x}\) is the sample mean \(\sum\sb{x\in X} x/n,\) ideally with constant query time, and constant and update time overhead.

One could try to apply the textbook equality

\[s\sp{2} = \frac{1}{n(n-1)}\left[n\sum\sb{x\in X} x\sp{2} - \left(\sum\sb{x\in X} x\right)\sp{2}\right].\]

However, as Knuth notes in TAoCP volume 2, this expression loses a lot of precision to round-off in floating point: in extreme cases, the difference might be negative (and we know the variance is never negative). More commonly, we'll lose precision when the sampled values are clustered around a large mean. For example, the sample standard deviation of 1e8 and 1e8 - 1 is 1, same as for 0 and 1. However, the expression above would evaluate that to 0.0, even in double precision: while 1e8 is comfortably within range for double floats, its square 1e16 is outside the range where all integers are represented exactly.

Knuth refers to a better behaved recurrence by Welford, where a running sample mean is subtracted from each new observation before squaring. John Cook has a C++ implementation of the recurrence that adds observations to a sample variance in constant time. In Python, this streaming algorithm looks like this.

streaming_variance.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class StreamingVariance:
    def __init__(self):
        self.n = 0
        self.mean = 0
        self.var_sum = 0  # centered 2nd moment (~variance of the sum of observations)

    def observe(self, v):
        self.n += 1
        if self.n == 1:
            self.mean = v
            return
        old_mean = self.mean
        self.mean += (v - old_mean) / self.n
        self.var_sum += (v - old_mean) * (v - self.mean)

        def get_mean(self):
        return self.mean

    def get_variance(self):
        return self.var_sum / (self.n - 1) if self.n > 1 else 0

That's all we need for insert-only multisets, but does not handle removals; if only we had removals, we could always implement updates (replacement) as a removal and an insertion.

Luckily, StreamingVariance.observe looks invertible. It's shouldn't be hard to recover the previous sample mean, given v, and, given the current and previous sample means, we can re-evaluate (v - old_mean) * (v - self.mean) and subtract it from self.var_sum.

Let \(\hat{x}\sp{\prime}\) be the sample mean after observe(v). We can derive the previous sample mean \(\hat{x}\) from \(v\):

\[(n - 1)\hat{x} = n\hat{x}\sp{\prime} - v \Leftrightarrow \hat{x} = \hat{x}\sp{\prime} + \frac{\hat{x}\sp{\prime} - v}{n-1}.\]

This invertibility means that we can undo calls to observe in LIFO order. We can't handle arbitrary multiset updates, only a stack of observation. That's still better than nothing.

variance_stack.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class VarianceStack:
    def __init__(self):
        self.n = 0
        self.mean = 0
        self.var_sum = 0  # variance of the sum

    def push(self, v):
        self.n += 1
        if self.n == 1:
            self.mean = v
            return
        old_mean = self.mean
        self.mean += (v - old_mean) / self.n
        self.var_sum += (v - old_mean) * (v - self.mean)

    def pop(self, v):
        assert self.n > 0
        if self.n == 1:
            self.n = 0
            self.mean = 0
            self.var_sum = 0
            return
        next_n = self.n - 1
        old_mean = self.mean
        self.mean = old_mean + (old_mean - v) / next_n
        # var_sum should never be negative, clamp it so.
        self.var_sum = max(0, self.var_sum - (v - self.mean) * (v - old_mean))
        self.n -= 1

    def get_mean(self):
        return self.mean

    def get_variance(self):
        return self.var_sum / (self.n - 1) if self.n > 1 else 0

Before going any further, let's test this.

Testing the VarianceStack

The best way to test the VarianceStack is to execute a series of push and pop calls, and compare the results of get_mean and get_variance with batch reference implementations.

I could hardcode calls in unit tests. However, that quickly hits diminishing returns in terms of marginal coverage VS developer time. Instead, I'll be lazy, completely skip unit tests, and rely on Hypothesis, its high level "stateful" testing API in particular.

We'll keep track of the values pushed and popped off the observation stack in the driver: we must make sure they're matched in LIFO order, and we need the stack's contents to compute the reference mean and variance. We'll also want to compare the results with reference implementations, modulo some numerical noise. Let's try to be aggressive and bound the number of float values between the reference and the actual results.

variance_stack_driver.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import struct
import unittest
import hypothesis.strategies as st
from hypothesis.stateful import RuleBasedStateMachine, invariant, precondition, rule


def float_bits(x: float) -> int:
    bits = struct.unpack('=q', struct.pack('=d', x))[0]
    significand = bits % (1 << 63)
    # ~significand = -1 - significand. We need that instead of just
    # -significand to handle signed zeros.
    return significand if bits >= 0 else ~significand


FLOAT_DISTANCE = 2**10


def assert_almost_equal(x, y, max_delta=FLOAT_DISTANCE):
    assert abs(float_bits(x) - float_bits(y)) <= max_delta


class VarianceStackDriver(RuleBasedStateMachine):
    def __init__(self):
        super(VarianceStackDriver, self).__init__()
        self.values = []

    @rule(v=st.floats(allow_nan=False, allow_infinity=False))
    def push(self, v):
        self.values.append(v)

    # Don't generate `pop()` calls when the stack is empty.
    @precondition(lambda self: self.values)
    @rule()
    def pop(self):
        self.values.pop()

    def reference_mean(self):
        if self.values:
            return sum(self.values) / len(self.values)
        return 0

    def reference_variance(self):
        n = len(self.values)
        if n <= 1:
            return 0
        mean = self.reference_mean()
        return sum(pow(x - mean, 2) for x in self.values) / n

    @invariant()
    def mean_matches(self):
        assert_almost_equal(self.reference_mean(), self.reference_mean())

    @invariant()
    def variance_matches(self):
        assert_almost_equal(self.reference_variance(),
                            self.reference_variance())


StackTest = VarianceStackDriver.TestCase

if __name__ == '__main__':
    unittest.main()

This initial driver does not even use the VarianceStack yet. All it does is push values to the reference stack, pop values when the stack has something to pop, and check that the reference implementations match themselves after each call: I want to first shake out any bug in the test harness itself.

Not surprisingly, Hypothesis does find an issue in the reference implementation:

Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=2.6815615859885194e+154)
state.teardown()

We get a numerical OverflowError in reference_variance: 2.68...e154 / 2 is slightly greater than sqrt(sys.float_info.max) = 1.3407807929942596e+154, so taking the square of that value errors out instead of returning infinity.

Let's start by clamping the range of the generated floats.

variance_stack_driver.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import math
import sys
...


MAX_RANGE = math.sqrt(sys.float_info.max) / 2

FLOAT_STRATEGY = st.floats(min_value=-MAX_RANGE, max_value=MAX_RANGE)

class VarianceStackDriver(RuleBasedStateMachine):
    ...

    @rule(v=FLOAT_STRATEGY)
    def push(self, v):
        self.variance_stack.push(v)
        self.values.append(v)

    ...

Now that the test harness doesn't find fault in itself, let's hook in the VarianceStack, and see what happens when only push calls are generated (i.e., first test only the standard streaming variance algorithm).

variance_stack_driver.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def assert_almost_equal(x, y, max_delta=FLOAT_DISTANCE):
    distance = abs(float_bits(x) - float_bits(y))
    # Print out some useful information on failure.
    assert distance <= max_delta, '%.18g != %.18g (%f)' % (x, y, math.log(distance, 2))

class VarianceStackDriver(RuleBasedStateMachine):
    def __init__(self):
        super(VarianceStackDriver, self).__init__()
        self.values = []
        self.variance_stack = VarianceStack()

    @rule(v=FLOAT_STRATEGY)
    def push(self, v):
        self.variance_stack.push(v)
        self.values.append(v)

    # Never generate `pop()`
    @precondition(lambda self: self.values and False)
    @rule()
    def pop(self):
        self.values.pop()

    def reference_mean(self):
        if self.values:
            return sum(self.values) / len(self.values)
        return 0

    def reference_variance(self):
        n = len(self.values)
        if n <= 1:
            return 0
        mean = self.reference_mean()
        return sum(pow(x - mean, 2) for x in self.values) / n

    @invariant()
    def mean_matches(self):
        assert_almost_equal(self.reference_mean(),
                            self.variance_stack.get_mean())

    @invariant()
    def variance_matches(self):
        assert_almost_equal(self.reference_variance(),
                            self.variance_stack.get_variance())

This already fails horribly.

Falsifying example:
state = VarianceStackDriver()
state.push(v=1.0)
state.push(v=1.488565707357403e+138)
state.teardown()
F

The reference finds a variance of 5.54e275, which is very much not the streaming computation's 1.108e276. We can manually check that the reference is wrong: it's missing the n - 1 correction term in the denominator.

We should use this updated reference.

variance_stack_driver.py

1
2
3
4
5
6
7
8
9
class VarianceStackDriver(RuleBasedStateMachine):
    ...

    def reference_variance(self):
        n = len(self.values)
        if n <= 1:
            return 0
        mean = self.reference_mean()
        return sum(pow(x - mean, 2) for x in self.values) / (n - 1)

Let's now re-enable calls to pop().

variance_stack_driver.py

1
2
3
4
5
6
7
8
class VarianceStackDriver(RuleBasedStateMachine):
    ...

    @precondition(lambda self: self.values)
    @rule()
    def pop(self):
        self.variance_stack.pop(self.values[-1])
        self.values.pop()

And now things fail in new and excitingly numerical ways.

Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=0.00014142319560050964)
state.push(v=14188.9609375)
state.pop()
state.teardown()
F

This counter-example fails with the online variance returning 0.0 instead of 1e-8. That's not unexpected: removing (the square of) a large value from a running sum spells catastrophic cancellation. It's also not that bad for my use case, where I don't expect to observe very large values.

Another problem for our test harness is that floats are very dense around 0.0, and I'm ok with small (around 1e-8) absolute error because the input and output will be single floats.

Let's relax assert_almost_equal, and restrict generated observations to fall in \([-2\sp{-12}, 2\sp{12}].\)

variance_stack_driver.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Let values be off by ~1 single float ULP
FLOAT_DISTANCE = 2**32

# or by 1e-8
ABSOLUTE_EPS = 1e-8


def assert_almost_equal(x, y, max_delta=FLOAT_DISTANCE, abs_eps=ABSOLUTE_EPS):
    delta = abs(x - y)
    distance = abs(float_bits(x) - float_bits(y))
    assert distance <= max_delta or delta <= abs_eps, '%.18g != %.18g (%f)' % (
        x, y, math.log(distance, 2))


# Avoid generating very large observations.
MAX_RANGE = 2**12

FLOAT_STRATEGY = st.floats(width=32, min_value=-MAX_RANGE, max_value=MAX_RANGE)

With all these tweaks to make sure we generate easy (i.e., interesting) test cases, Hypothesis fails to find a failure after its default time budget.

I'm willing to call that a victory.

From stack to full multiset

We have tested code to undo updates in Welford's classic streaming variance algorithm. Unfortunately, inverting pushes away only works for LIFO edits, and we're looking for arbitrary inserts and removals (and updates) to a multiset of observations.

However, both the mean \(\hat{x} = \sum\sb{x\in X} x/n\) and the centered second moment \(\sum\sb{x\in X}(x - \hat{x})\sp{2}\) are order-independent: they're just sums over all observations. Disregarding round-off, we'll find the same mean and second moment regardless of the order in which the observations were pushed in. Thus, whenever we wish to remove an observation from the multiset, we can assume it was the last one added to the estimates, and pop it off.

We think we know how to implement running mean and variance for a multiset of observations. How do we test that with Hypothesis?

The hardest part about testing dictionary (map)-like interfaces is making sure to generate valid identifiers when removing values. As it turns out, Hypothesis has built-in support for this important use case, with its Bundles. We'll use that to test a dictionary from observation name to observation value, augmented to keep track of the current mean and variance of all values.

variance_multiset_driver.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class VarianceBag(VarianceStack):
    def update(self, old, new):
        # Replace one instance of `old` with `new` by
        # removing `old` and inserting `new`.
        self.pop(old)
        self.push(new)


class VarianceBagDriver(RuleBasedStateMachine):
    keys = Bundle("keys")

    def __init__(self):
        super(VarianceBagDriver, self).__init__()
        self.entries = dict()
        self.variance_bag = VarianceBag()

    @rule(target=keys, k=st.binary(), v=FLOAT_STRATEGY)
    def add_entry(self, k, v):
        if k in self.entries:
            self.update_entry(k, v)
            return multiple()

        self.entries[k] = v
        self.variance_bag.push(v)
        return k

    @rule(k=consumes(keys))
    def del_entry(self, k):
        self.variance_bag.pop(self.entries[k])
        del self.entries[k]

    @rule(k=keys, v=FLOAT_STRATEGY)
    def update_entry(self, k, v):
        self.variance_bag.update(self.entries[k], v)
        self.entries[k] = v

    def reference_mean(self):
        if self.entries:
            return sum(self.entries.values()) / len(self.entries)
        return 0

    def reference_variance(self):
        n = len(self.entries)
        if n <= 1:
            return 0
        mean = self.reference_mean()
        return sum(pow(x - mean, 2) for x in self.entries.values()) / (n - 1)

    @invariant()
    def mean_matches(self):
        assert_almost_equal(self.reference_mean(),
                            self.variance_bag.get_mean())

    @invariant()
    def variance_matches(self):
        assert_almost_equal(self.reference_variance(),
                            self.variance_bag.get_variance())


BagTest = VarianceBagDriver.TestCase

Each call to add_entry will either go to update_entry if the key already exists, or add an observation to the dictionary and streaming estimator. If we have a new key, it is added to the keys Bundle; calls to del_entry and update_entry draw keys from this Bundle. When we remove an entry, it's also consumed from the keys Bundle.

Hypothesis finds no fault with our new implementation of dictionary-with-variance, but update seems like it could be much faster and numerically stable, and I intend to mostly use this data structure for calls to update.

Faster multiset updates

The key operation for my use-case is to update one observation by replacing its old value with a new one. We can maintain the estimator by popping old away and pushing new in, but this business with updating the number of observation n and rescaling everything seems like a lot of numerical trouble.

We should be able to do better.

We're replacing the multiset of sampled observations \(X\) with \(X\sp{\prime} = X \setminus \{\textrm{old}\} \cup \{\textrm{new}\}.\) It's easy to maintain the mean after this update: \(\hat{x}\sp{\prime} = \hat{x} + (\textrm{new} - \textrm{old})/n.\)

The update to self.var_sum, the sum of squared differences from the mean, is trickier. We start with \(v = \sum\sb{x\in X} (x - \hat{x})\sp{2},\) and we wish to find \(v\sp{\prime} = \sum\sb{x\sp{\prime}\in X\sp{\prime}} (x\sp{\prime} - \hat{x}\sp{\prime})\sp{2}.\)

Let \(\delta = \textrm{new} - \textrm{old}\) and \(\delta\sb{\hat{x}} = \delta/n.\) We have \[\sum\sb{x\in X} (x - \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} [(x - \hat{x}) - \delta\sb{\hat{x}}]\sp{2},\] and \[[(x - \hat{x}) - \delta\sb{\hat{x}}]\sp{2} = (x - \hat{x})\sp{2} - 2\delta\sb{\hat{x}} (x - \hat{x}) + \delta\sb{\hat{x}}\sp{2}.\]

We can reassociate the sum, and find

\[\sum\sb{x\in X} (x - \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} (x - \hat{x})\sp{2} - 2\delta\sb{\hat{x}} \left(\sum\sb{x \in X} x - \hat{x}\right) + n \delta\sb{\hat{x}}\sp{2}\]

Once we notice that \(\hat{x} = \sum\sb{x\in X} x/n,\) it's clear that the middle term sums to zero, and we find the very reasonable

\[v\sb{\hat{x}\sp{\prime}} = \sum\sb{x\in X} (x - \hat{x})\sp{2} + n \delta\sb{\hat{x}}\sp{2} = v + \delta \delta\sb{\hat{x}}.\]

This new accumulator \(v\sb{\hat{x}\sp{\prime}}\) corresponds to the sum of the squared differences between the old observations \(X\) and the new mean \(\hat{x}\sp{\prime}\). We still have to update one observation from old to new. The remaining adjustment to \(v\) (self.var_sum) corresponds to going from \((\textrm{old} - \hat{x}\sp{\prime})\sp{2}\) to \((\textrm{new} - \hat{x}\sp{\prime})\sp{2},\) where \(\textrm{new} = \textrm{old} + \delta.\)

After a bit of algebra, we get \[(\textrm{new} - \hat{x}\sp{\prime})\sp{2} = [(\textrm{old} - \hat{x}\sp{\prime}) + \delta]\sp{2} = (\textrm{old} - \hat{x}\sp{\prime})\sp{2} + \delta (\textrm{old} - \hat{x} + \textrm{new} - \hat{x}\sp{\prime}).\]

The adjusted \(v\sb{\hat{x}\sp{\prime}}\) already includes \((\textrm{old} - \hat{x}\sp{\prime})\sp{2}\) in its sum, so we only have to add the last term to obtain the final updated self.var_sum

\[v\sp{\prime} = v\sb{\hat{x}\sp{\prime}} + \delta (\textrm{old} - \hat{x} + \textrm{new} - \hat{x}\sp{\prime}) = v + \delta [2 (\textrm{old} - \hat{x}) + \textrm{new} - \hat{x}\sp{\prime}].\]

That's our final implementation for VarianceBag.update, for which Hypothesis also fails to find failures.

VarianceBag.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class VarianceBag(VarianceStack):
    def update(self, old, new):
        assert self.n > 0
        if self.n == 1:
            self.mean = new
            self.var_sum = 0
            return
        delta = new - old
        old_mean = self.mean
        delta_mean = delta / self.n
        self.mean += delta_mean

        adjustment = delta * (2 * (old - old_mean) + (delta - delta_mean))
        self.var_sum = max(0, self.var_sum + adjustment)

How much do you trust testing?

We have automated property-based tests and some human-checked proofs. Ship it?

I was initially going to ask a CAS to check my reformulations, but the implicit \(\forall\) looked messy. Instead, I decided to check the induction hypothesis implicit in VarianceBag.update, and enumerate all cases up to a certain number of values with Z3 in IPython.

In [1]: from z3 import *
In [2]: x, y, z, new_x = Reals("x y z new_x")
In [3]: mean = (x + y + z) / 3
In [4]: var_sum = sum((v - mean) * (v - mean) for v in (x, y, z))
In [5]: delta = new_x - x
In [6]: new_mean = mean + delta / 3
In [7]: delta_mean = delta / 3
In [8]: adjustment = delta * (2 * (x - mean) + (delta - delta_mean))
In [9]: new_var_sum = var_sum + adjustment

# We have our expressions. Let's check equivalence for mean, then var_sum
In [10]: s = Solver() 
In [11]: s.push()
In [12]: s.add(new_mean != (new_x + y + z) / 3)
In [13]: s.check()
Out[13]: unsat  # No counter example of size 3 for the updated mean
In [14]: s.pop()

In [15]: s.push()
In [16]: s.add(new_mean == (new_x + y + z) / 3)  # We know the mean matches
In [17]: s.add(new_var_sum != sum((v - new_mean) * (v - new_mean) for v in (new_x, y, z)))
In [18]: s.check()
Out[18]: unsat  # No counter example of size 3 for the updated variance

Given this script, it's a small matter of programming to generalise from 3 values (x, y, and z) to any fixed number of values, and generate all small cases up to, e.g., 10 values.

z3-check.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def updated_expressions(vars, new_x):
    x = vars[0]
    num_var = len(vars)
    mean = sum(vars) / num_var
    var_sum = sum((v - mean) * (v - mean) for v in vars)
    delta = new_x - x
    delta_mean = delta / num_var
    new_mean = mean + delta_mean
    adjustment = delta * (2 * (x - mean) + (delta - delta_mean))
    new_var_sum = var_sum + adjustment
    return new_mean, new_var_sum


def test_num_var(num_var):
    assert num_var > 0
    vars = [Real('x_%i' % i) for i in range(0, num_var)]
    new_x = Real('new_x')

    new_mean, new_var_sum = updated_expressions(vars, new_x)
    new_vars = [new_x] + vars[1:]
    s = Solver()
    s.push()
    s.add(new_mean != sum(new_vars) / num_var)
    result = s.check()
    print('updated mean %s' % result)
    if result != unsat:
        print(s.model())
        return False
    s.pop()

    s.push()
    s.add(new_mean == sum(new_vars) / num_var)
    s.add(new_var_sum != sum(
        (v - new_mean) * (v - new_mean) for v in new_vars))
    result = s.check()
    print('updated variance %s' % result)
    if result != unsat:
        print(s.model())
        return False
    return True


for i in range(1, 11):
    print('testing n=%i' % i)
    if test_num_var(i):
        print('OK')
    else:
        print('FAIL %i' % i)
        break

I find the most important thing when it comes to using automated proofs is to insert errors and confirm we can find the bugs we're looking for.

I did that by manually mutating the expressions for new_mean and new_var_sum in updated_expressions. This let me find a simple bug in the initial implementation of test_num_var: I used if not result instead of result != unsat, and both sat and unsat are truthy. The code initially failed to flag a failure when z3 found a counter-example for our correctness condition!

And now I'm satisfied

I have code to augment an arbitrary multiset or dictionary with a running estimate of the mean and variance; that code is based on a classic recurrence, with some new math checked by hand, with automated tests, and with some exhaustive checking of small inputs (to which I claim most bugs can be reduced).

I'm now pretty sure the code works, but there's another more obviously correct way to solve that update problem. This 2008 report by Philippe Pébay1 presents formulas to compute the mean, variance, and arbitrary moments in one pass, and shows how to combine accumulators, a useful operation in parallel computing.

We could use these formulas to augment an arbitrary \(k\)-ary tree and re-combine the merged accumulator as we go back up the (search) tree from the modified leaf to the root. The update would be much more stable (we only add and merge observations), and incur logarithmic time overhead (with linear space overhead). However, given the same time budget, and a logarithmic space overhead, we could also implement the constant-time update with arbitrary precision software floats, and probably guarantee even better precision.

The constant-time update I described in this post demanded more effort to convince myself of its correctness, but I think it's always a better option than an augmented tree for serial code, especially if initial values are available to populate the accumulators with batch-computed mean and variance. I'm pretty sure the code works, and it's up in this gist. I'll be re-implementing it in C++ because that's the language used by the project that lead me to this problem; feel free to steal that gist.


  1. There's also a 2016 journal article by Pébay and others with numerical experiments, but I failed to implement their simpler-looking scalar update...

01 Dec 2019 4:51am GMT

30 Nov 2019

feedPlanet Lisp

Quicklisp news: November 2019 Quicklisp dist update now available

New projects:

Updated projects: april, assert-p, assertion-error, async-process, babel, bordeaux-threads, bp, ceramic, chameleon, chanl, cl+ssl, cl-async, cl-bnf, cl-collider, cl-dct, cl-editdistance, cl-environments, cl-fond, cl-i18n, cl-kraken, cl-kyoto-cabinet, cl-lzma, cl-naive-store, cl-opengl, cl-patterns, cl-prevalence, cl-pslib, cl-rdkafka, cl-redis, cl-sdl2, cl-smtp, cl-store, cl-str, cl-tiled, cl-toml, cl-torrents, closer-mop, clsql-local-time, clss, clx, clx-xembed, coleslaw, common-lisp-actors, conf, croatoan, curry-compose-reader-macros, dartsclhashtree, declt, deploy, drakma, dufy, eager-future2, eventbus, femlisp, fiasco, gendl, generic-cl, graph, horner, hunchentoot-multi-acceptor, jsown, lake, let-plus, lichat-protocol, linear-programming, literate-lisp, lyrics, magicl, maiden, markup, mcclim, metabang-bind, mmap, nodgui, num-utils, overlord, parachute, paren6, petalisp, phoe-toolbox, polisher, postmodern, py4cl, qbase64, qt-libs, qtools, quilc, quri, qvm, replic, roan, rutils, sc-extensions, scalpl, sel, serapeum, shadow, simple-parallel-tasks, simplet, skeleton-creator, sly, staple, static-vectors, studio-client, stumpwm, swap-bytes, sxql, the-cost-of-nothing, trivia, trivial-benchmark, trivial-jumptables, trivial-package-local-nicknames, trivial-ssh, type-r, umbra, vernacular, woo, wookie, xml-emitter, youtube.

To get this update, use: (ql:update-dist "quicklisp")

30 Nov 2019 9:09pm GMT