23 Jan 2020

feedPlanet Lisp

Lispers.de: Berlin Lispers Meetup, Monday, 27th January 2020

We meet again on Monday 8pm, 27th January. Our host this time is James Anderson (www.dydra.com).

Berlin Lispers is about all flavors of Lisp including Common Lisp, Emacs Lisp and MacLisp.

Jörg Preisendörfer will talk about the Ion serialization format and Common Lisp. Ion is represented by either a human-readable text form or a compact binary form. The text form is a superset of JSON. [1]

We will also talk about the upcoming Lisp-related conferences in 2020.

[1] https://en.wikipedia.org/wiki/Ion_(serialization_format)

We meet in the Taut-Haus at Engeldamm 70 in Berlin-Mitte, the bell is "James Anderson". It is located in 10min walking distance from U Moritzplatz or U Kottbusser Tor or Ostbahnhof. In case of questions call Christian +49 1578 70 51 61 4.

23 Jan 2020 12:00am GMT

22 Jan 2020

feedPlanet Lisp

Joe Marshall: But what if you really want to push a stack frame?

If you really don't want tail recursion, the solution is simple: don't put the call in "tail position". We define a rather trivial function dont-tail-call and use it like this:

(dont-tail-call
  (foo x))

The semantics of Scheme are that the arguments are evaluated before the call to the function, so the call to foo is required to occur before the call to dont-tail-call which necessitates allocation of a continuation.

But what about a really clever compiler that somehow "optimizes" away the call to dont-tail-call? Such an "optimization" is highly questionable because it changes the observable semantics of the program so it is no longer call-by-value. A compiler that performed that transformation wouldn't be a Scheme compiler because Scheme is a call-by-value language.

But what if we had really clever compiler and we disregard the change in semantics? We can easily thwart the compiler by deferring the definition of dont-tail-call until run time. Even the cleverest compiler cannot see into the future.

The definition of dont-tail-call is left to the reader, as is how to defer it's definition until run time.

22 Jan 2020 3:28am GMT

21 Jan 2020

feedPlanet Lisp

Joe Marshall: Afraid of Tail Recursion

It's well known fact among proponents of tail recursion that some people just don't get it. They view tail recursion at best as a quirky compiler optimization that turns some recursive calls into loops. At worst, they see it as some sort of voodoo, or a debugging pessimization. They see little value in it. Some have outright disdain for it.

Tail recursion isn't just about turning recursive calls into loops. It's about changing how you look at function calling. Tail recursion just happens to fall out of this new viewpoint.

Most programmers, I think, view function calls as if they were akin to a short vacation. You pack up the arguments in your luggage, travel to the destination, unpack your luggage, do some stuff, repack your luggage with some souvenirs, return home, unpack everything and resume life where you left off. Your souvenirs are the return value.

Should you need a vacation from your vacation, you do the same thing: pack up the arguments in your luggage, travel to your new destination, unpack your luggage, do some stuff, repack your luggage with some souvenirs, return to your original vacation spot and resume your original vacation.

Tail recursion aficionados realize that the journey itself is the important part of the function call, and that a vacation includes two journeys. On the first journey you pack up the arguments, including the return ticket, in your luggage, use the outbound ticket to journey to the destination, unpack your luggage, and start doing stuff. When you run out of stuff to do, you make the second journey. You fetch the return ticket, repack your luggage, take the ticket to wherever it leads (presumably back home), unpack everything, and resume doing whatever you were doing there.

But perhaps you want to visit grandma instead of going directly home. Then we change the script slightly. When you run out of things to do on your vacation, you pack up your luggage with your souvenirs and the return ticket, then you journey to grandma's house, where you unpack and start doing stuff. Eventually you are done visiting grandma, so then you fetch the return ticket, repack your luggage, take the ticket to wherever it leads, unpack everything, and resume doing stuff there. It's a three-legged journey. You don't go from grandma's back to the vacation resort - there's nothing left for you to do there. You take the return ticket directly home.

Viewing things this way, a function call involves packaging the arguments in a suitable way, deallocating any temporary storage, and then making an unconditional transfer to the function, where we unpack the arguments and resume execution of the program. It is simply "a goto that passes arguments".*

A function return is simply "a goto that passes a return value". It involves packaging the return value in a suitable way, deallocating any temporary storage, and then making an unconditional transfer to the return address, where we resume execution of the program.

A tail recursive function call is simply "a goto that passes arguments". It involves packaging the arguments in a suitable way, deallocating any temporary storage and then making an unconditional transfer to the function, where we resume execution of the program.

Do we really deallocate temporary storage before every control transfer? Certainly a return pops the topmost stack frame, and as often implemented, a tail recursive function call deallocates its stack frame or replaces it before transferring control, but a non tail recursive call? It does so as well, it's just that it also has to pack those values into a new continuation for the return trip. We use an implementation trick to avoid the absurdity of actually moving these values around: we move the base of frame pointer instead. Voila, we simultaneously deallocate the stack frame and allocate the continuation with the right values already in place.

Deallocating storage before each control transfer is an important part of the protocol. We're making a unconditional transfer to a destination with the hope, but no guarantee, that we'll come back, so we'd better tidy up before we leave. This ensures that we won't leave a trail of garbage behind us on each transfer which would accumulate and adversely affect the space complexity of our program.

Once you view a function call and return as not being a single sequence, but each one a separate, and virtually identical sequence, then tail recursion becomes a natural consequence. Tail recursion isn't a special case of function call, it is the same thing as a function call, the only difference being whether a new continuation (the "return ticket") is allocated in order to come back. Even function returns are the same thing, the only difference being that destination is (usually) determined dynamically rather than statically. Tail recursion isn't just another optimization, it's the result of treating inter-procedural control transfer symmetrically.

Another natural consequence is greatly increased options for control flow. Instead of a strict call/return pattern, you can make "three-legged" trips, or however many legs you need. You can make loops that incorporate one, two, or even a dynamically changing number of functions. You can replace compiler-generated returns with user-provided function calls (continuation-passing style) and implement arbitrarily complex control and data flow like multiple return values, exception handling, backtracking, coroutines, and patterns that don't even have names yet. And of course you can mix and match these patterns with the standard call and return pattern as well.

The phrase "tail recursion" is shorthand for this symmetric view of interprocedural control flow and is meant to encompass all these consequences and control flow options that spring from it. It's not about simply turning recursive functions into loops.

People who are afraid of tail recursion seem unable to see any value in taking up the symmetric viewpoint despite the fact that it opens up a whole new set of control flow techniques (in particular continuation-passing style). They find the notion that a procedure call is "a goto that passes arguments" "nonsensical". A lot of good research has come from taking this nonsense seriously.



*The view that a function call is simply a "a goto that passes arguments" was developed by Steele in his "Lambda papers".

The important point of cleaning up before the control transfer was formalized by Clinger in "Proper Tail Recursion and Space Efficiency".

Someone - it might have been Clinger, but I cannot find a reference - called tail recursion "garbage collection for the stack". The stack, being so much more limited in size than the heap, needs it that much more. Indeed Clinger notes the tight connection between tail recursion and heap garbage collection and points out that heap garbage collection is hampered if the stack is retaining pointers to logically dead data structures. If the dead structures are large enough, garbage collection can be rendered useless. Yet many popular languages provide garbage collection but not tail recursion.

The only difference between a call and return is that typically the call is to a statically known location and the return address is dynamically passed as a "hidden" argument. But some compilers, like Siskind's Stalin compiler, statically determine the return address as well.

The only difference between a function call and a tail recursive function call is when you need to return to the caller to complete some work. In this case, the caller needs to allocate a new continuation so that control is eventually returned. If there is no further work to be done in the caller, it doesn't create a new continuation, but simply passes along the one that was passed to it.

Many compilers have been written that handle function calls, tail recursive function calls, and returns identically. They only change what code they emit for handling the continuation allocation. These compilers naturally produce tail recursive code.

Most machines provide special purpose support for a LIFO stack. It is tempting to use the stack for allocation of continuations because they are almost always allocated and deallocated in LIFO order, and a stack gives higher performance when this is the case. Many compilers do in fact use the stack for continuations and argument passing. Some, like Winklemann's Chicken compiler follow Baker's suggestion and treat the stack as an "nursery" for the heap. Others avoid using the stack altogether because of the extra complexity it entails. And others cannot use the stack because of constraints placed on stack usage by OS conventions or the underlying virtual machine model.

21 Jan 2020 10:58am GMT

Paul Khuong: Lazy Linear Knapsack

The continuous knapsack problem may be the simplest non-trivial linear programming problem:

\[\max_{x \in [0, 1]^n} p'x\] subject to \[w'x \leq b.\]

It has a linear objective, one constraint, and each decision variable is bounded to ensure the optimum exists. Note the key difference from the binary knapsack problem: decision variables are allowed to take any value between 0 and 1. In other words, we can, e.g., stick half of a profitable but large item in the knapsack. That's why this knapsack problem can be solved in linear time.

Dual to primal is reasonable

Duality also lets us determine the shape of all optimal solutions to this problem. For each item \(i\) with weight \(w_i\) and profit \(p_i\), let its profit ratio be \(r_i = p_i / w_i,\) and let \(\lambda^\star\) be the optimal dual (Lagrange or linear) multiplier associated with the capacity constraint \(w'x \leq b.\) If \(\lambda^\star = 0,\) we simply take all items with a positive profit ratio (\(r_i > 0\)) and a non-negative weight \(w_i \geq 0.\) Otherwise, every item with a profit ratio \(r_i > \lambda^\star\) will be at its weight upper bound (1 if \(w_i \geq 0\), 0 otherwise), and items with \(r_i < \lambda^\star\) will instead be at their lower bound (0 of \(w_i \leq 0\), and 1 otherwise).

Critical items, items with \(r_i = \lambda^\star,\) will take any value that results in \(w'x = b.\) Given \(\lambda^\star,\) we can derive the sum of weights for non-critical items; divide the remaining capacity for critical items by the total weight of critical items, and let that be the value for every critical item (with the appropriate sign for the weight).

For example, if we have capacity \(b = 10,\) and the sum of weights for non-critical items in the knsapsack is \(8,\) we're left with another two units of capacity to distribute however we want among critical items (they all have the same profit ratio \(r_i = \lambda^\star,\) so it doesn't matter where that capacity goes). Say critical items with a positive weight have a collective weight of 4; we could then assign a value of \(2 / 4 = 0.5\) to the corresponding decision variable (and 0 for critical items with a non-positive weight).

We could instead have \(b = 10,\) and the sum of weights for non-critical items in the knapsack \(12\): we must find two units of capacity among critical items (they all cost \(r_i = \lambda^\star\) per unit, so it doesn't matter which). If critical items with a negative weight have a collective weight of \(-3,\) we could assign a value of \(-2 / -3 = 0.6\overline{6}\) to the corresponding decision variables, and 0 for critical items with a non-negative weight.

The last case highlights something important about the knapsack: in general, we can't assume that the weights or profits are positive. We could have an item with a non-positive weight and non-negative profit (that's always worth taking), an item with positive weight and negative profit (never interesting), or weights and profits of the same sign. The last case is the only one that calls for actual decision making. Classically, items with negative weight and profit are rewritten away, by assuming they're taken in the knapsack, and replacing them with a decision variable for the complementary decision of removing that item from the knapsack (i.e., removing the additional capacity in order to improve the profit). I'll try to treat them directly as much as possible, because that reduction can be a significant fraction of solve times in practice.

The characterisation of optimal solutions above makes it easy to directly handle elements with a negative weight: just find the optimal multiplier, compute the contribution of non-critical elements (with decision variables at a bound) to the left-hand side of the capacity constraint, separately sums the negative and positive weights for critical elements, then do a final pass to distribute the remaining capacity to critical elements (and 0-weight / 0-value elements if one wishes).

Solving the dual looks like selection

Finding the optimal multiplier \(\lambda^\star\) is similar to a selection problem: the value is either 0 (the capacity constraint is redundant), or one of the profit ratios \(r_i,\) and, given a multiplier value \(\lambda,\) we can determine if it's too high or too low in linear time. If the non-critical elements yield a left-hand side such that critical elements can't add enough capacity (i.e., no solution with the optimal form can be feasible), \(\lambda\) is too low. If the maximum weight of potentially optimal solutions is too low, \(\lambda\) is too high.

We can thus sort the items by profit ratio \(r_i\), compute the total weight corresponding to each ratio with a prefix sum (with a pre-pass to sum all negative weights), and perform a linear (or binary) search to find the critical profit ratio. Moreover, the status of non-critical items is monotonic as \(\lambda\) grows: if an item with positive weight is taken at \(\lambda_0\), it is also taken for every \(\lambda \leq \lambda_0\), and a negative-weight item that's taken at \(\lambda_0\) is also taken for every \(\lambda \geq \lambda_0.\) This means we can adapt selection algorithms like Quickselect to solve the continuous knapsack problem in linear time.

I'm looking at large instances, so I would like to run these algorithms in parallel or even distributed on multiple machines, and ideally use GPUs or SIMD extensions. Unfortunately, selection doesn't parallelise very well: we can run a distributed quickselect where every processor partitions the data in its local RAM, but that still requires a logarithmic number of iterations.

Selection looks like quantile estimation; does the dual?

Lazy Select offers a completely different angle for the selection problem. Selecting the \(k\)th smallest element from a list of \(n\) elements is the same as finding the \(k / n\)th quantile1 in that list of \(n\) elements. We can use concentration bounds2 to estimate quantiles from a sample of, e.g., \(m = n^{3/4}\) elements: the population quantile value is very probably between the \(qm - \frac{\log m}{\sqrt{m}}\)th and \(qm + \frac{\log m}{\sqrt{m}}\)th values of the sample. Moreover, this range very probably includes at most \(\mathcal{O}(n^{3/4})\) elements3, so a second pass suffices to buffer all the elements around the quantile, and find the exact quantile. Even with a much smaller sample size \(m = \sqrt{n},\) we would only need four passes.

Unfortunately, we can't directly use that correspondence between selection and quantile estimation for the continuous knapsack.

I tried to apply a similar idea by sampling the knapsack elements equiprobably, and extrapolating from a solution to the sample. For every \(\lambda,\) we can derive a selection function \(f_\lambda (i) = I[r_i \geq \lambda]w_i\) (invert the condition if the weight is negative), and scale up \(\sum_i f(i)\) from the sample to the population). As long as we sample independently of \(f\), we can reuse the same sample for all \(f_\lambda.\) The difficulty here is that, while the error for Lazy Select scales as a function of \(n,\) the equivalent bounds with variable weights are a function of \(n(|\max_i w_i| + |\min_i w_i|)^2.\) That doesn't seem necessarily practical; scaling with \(\sum_i |w_i|\) would be more reasonable.

Good news: we can hit that, thanks to linearity.

Let's assume weights are all integers. Any item with weight \(w_i\) is equivalent to \(w_i\) subitems with unit weight (or \(-w_i\) elements with negative unit weight), and the same profit ratio \(r_i\), i.e., profit \(p_i / |w_i|\). The range of subitem weights is now a constant.

We could sample uniformly from the subitems with a Bernoulli for each subitem, but that's clearly linear time in the sum of weights, rather than the number of elements. If we wish to sample roughly \(m\) elements from a total weight \(W = \sum_i |w_i|,\) we can instead determine how many subitems (units of weight) to skip before sampling with a Geometric of success probability \(m / W.\) This shows us how to lift the integrality constraint on weights: sample from an Exponential with the same parameter \(m / W!\)

That helps, but we could still end up spending much more than constant time on very heavy elements. The trick is to deterministically special-case these elements: stash any element with large weight \(w_i \geq W / m\) to the side, exactly once. By Markov's inequality,4 we know there aren't too many heavy elements: at most \(m.\)

Let's test this out

The heart of the estimation problem can be formalised as follows: given a list of elements \(i \in [n]\) with weight \(w_i \geq 0\), generate a sample of \(m \leq n\) elements ahead of time. After the sample has been generated, we want to accept an arbitrary predicate \(p \in \{0,1\}^n\) and estimate \(\sum_{i\in [n]} p(i) w_i.\)

We just had a sketch of an algorithm for this problem. Let's see what it looks like in Python. The initial sample logic has to determine the total weight, and sample items with probability proportional to their weight. Items heavier than the cutoff are not considered in the sample and instead saved to an auxiliary list.

sample.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def total_weight(items):
    return sum(weight for (_, weight) in items)


def sample_by_weight(items, rate, cutoff):
    """Samples from a list of (index, weight), with weight >= 0.

    Items with weight >= cutoff are taken with probability one.
    Others are sampled with rate `rate` / unit of weight.
    """
    sample = []
    large = []
    next_sample = random.expovariate(rate)
    for item in items:
        index, weight = item
        if weight >= cutoff:
            large.append(item)
        else:
            next_sample -= weight
            while next_sample <= 0:
                sample.append(index)
                next_sample += random.expovariate(rate)
    return sample, large

We can assemble the resulting sample (and list of "large" elements) to compute a lower bound on the weight of items that satisfy any predicate that's independent of the sampling decisions. The value for large elements is trivial: we have a list of all large elements. We can subtract the weight of all large elements from the total item weight, and determine how much we have to extrapolate up.

extrapolate.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def hoeffding(n, alpha):
    """Determines how much we can expect a sample of n i.i.d. values
    sampled from a Bernouli to differ, given an error rate of alpha.

    Given a sample X of n i.i.d. values from a Bernoulli distribution,
    let delta be \bar{X} - E[\bar{X}], the one-sided difference
    between the sample average value and the expected sample average.

    Hoeffding's upper bound (see below) is conservative when the
    empirical probability is close to 0 or 1 (trivially, it can yield
    confidence bounds that are outside [0, 1]!), but simple, and in
    general not much worse than tighter confidence interval.

    P(delta >= eps) <= exp(-2 eps^2 n) = alpha
      -> -2 eps^2 n = ln alpha
     <->        eps = sqrt[-(ln alpha) / 2n ]

    """
    return math.sqrt(- math.log(alpha) / (2 * n))


def eval_weight(total_weight, sample, large, predicate, alpha):
    """Given a population's total weight, a memoryless sample (by weight)
    from the population's items, and large items that were
    deterministically picked, evaluates a lower bound for the sum of
    weights for items in the population that satisfy predicate.
    
    The lower bound is taken with error rate <= alpha.
    """
    large_sum = sum(weight for (index, weight) in large if predicate(index))
    # The remainder was up for sampling, unit of weight at a time.
    sampled_weight = total_weight - sum(weight for (_, weight) in large)
    if sampled_weight <= 0 or not sample:
        return large_sum
    # Estimate the Binomial success rate with a Beta
    successes = sum(1 if predicate(x) else 0 for x in sample)
    failures = len(sample) - successes
    # We want a lower bound, and the uniform prior can result in a
    # (valid) bound that's higher than the empirical rate, so take the
    # min of the two.
    empirical_rate = successes / sampled_weight
    delta = hoeffding(len(sample), alpha)
    return large_sum + sampled_weight * max(0, empirical_rate - delta)

And finally, here's how we can sample from an arbitrary list of items, compure a lower bound on the weight of items that satisfy a predicate, and compare that with the real lower bound.

lower_bound.py
1
2
3
4
5
6
7
8
9
def compare_bounds(items, rate, alpha, predicate):
    total = total_weight(items)
    # We expect a sample size of roughly rate * len(items), and
    # at most rate * len(items) large items.
    sample, large = sample_by_weight(items, rate, rate * total)
    lower_bound = eval_weight(total, sample, large, predicate, alpha)
    # Check if the lower bound is valid.
    actual = sum(weight for (index, weight) in items if predicate(index))
    return lower_bound <= actual + 1e-8, lower_bound, actual

How do we test that? Far too often, I see tests for randomised algorithms where the success rate is computed over randomly generated inputs. That's too weak! For example, this approach could lead us to accept that the identity function is a randomised sort function, with success probability \(\frac{1}{n!}.\)

The property we're looking for is that, for any input, the success rate (with the expectation over the pseudorandom sampling decisions) is as high as requested.

For a given input (list of items and predicate), we can use the Confidence sequence method (CSM) to confirm that the lower bound is valid at least \(1 - \alpha\) of the time.

csm_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def compare_bounds_generator(test_case, rate, alpha):
    items = [(i, w) for (i, (w, _)) in enumerate(test_case)]
    chosen = set(i for (i, (_, p)) in enumerate(test_case) if p)
    while True:
        yield compare_bounds(items, rate, alpha, lambda x: x in chosen)[0]


def check_bounds(test_case, rate, alpha):
    """Test case is a list of pairs of weight and predicate value
       rate is the sample rate
       alpha is the confidence parameter for the lower bound.
    """
    wanted = 1 - alpha  # The Hoeffding bound is conservative, so
                        # this should let csm_driver stop quickly.
    result = csm.csm_driver(compare_bounds_generator(test_case,
                                                     rate,
                                                     alpha),
                            wanted,
                            1e-6,  # Wrong conclusion with p < 1e-6.
                            file=sys.stderr
                            )
    stop, actual, *_ = result
    assert actual >= wanted, "Result: %s" % str(result)

With a false positive rate of at most one in a million,5 we can run automated tests against check_bounds. I'll use Hypothesis to generate list of pairs of weight and predicate value:

test_bounds.py
1
2
3
4
5
6
7
8
9
from hypothesis import given, settings, Verbosity
import hypothesis.strategies as st

@given(test_case=st.lists(st.tuples(st.floats(min_value=0, max_value=1),
                                    st.booleans())),
       rate=st.floats(min_value=1e-6, max_value=0.5),
       alpha=st.floats(min_value=0.05, max_value=0.25))
def test_bounds(test_case, rate, alpha):
    check_bounds(test_case, rate, alpha)

Bimodal inputs tend to be harder, so we can add a specialised test generator.

test_bimodal_bounds.py
1
2
3
4
5
6
@given(test_case=st.lists(st.tuples(st.one_of(st.just(0.1), st.just(1)),
                                    st.booleans())),
       rate=st.floats(min_value=1e-6, max_value=0.5),
       alpha=st.floats(min_value=0.05, max_value=0.25))
def test_bimodal_bounds(test_case, rate, alpha):
    check_bounds(test_case, rate, alpha)

Again, we use Hypothesis to generate inputs, and the Confidence sequence method (available in C, Common Lisp, and Python) to check that the lower bound is valid with probability at least \(1 - \alpha\). The CSM tests for this statistical property with power 1 and adjustable error rate (in our case, one in a million): we only provide a generator for success values, and the driver adaptively determines when it makes sense to make a call and stop generating more data, while accounting for multiple hypothesis testing.

TL;DR: the estimation algorithm for individual sampling passes works, and the combination of Hypothesis and Confidence Sequence Method lets us painlessly test for a statistical property.

We can iteratively use this sampling procedure to derive lower and (symmetrically) upper bounds for the optimal Lagrange multiplier \(\lambda^\star,\) and Hoeffding's inequality lets us control the probability that the lower and upper bounds are valid. Typically, we'd use a tolerance of \(\sqrt{\log(n) / n},\) for an error rate of \(1 / n^2.\) I prefer to simply use something like \(7 / \sqrt{n}:\) the error rate is then less than \(10^{-42},\) orders of manitude smaller than the probability of hardware failure in any given nanosecond.6 We can still check for failure of our Las Vegas algorithm, but if something went wrong, it's much more likely that we detected a hardware failure than anything else. It's like running SuperPi to stress test a computer, except the work is useful. 😉

Repeat as necessary to solve a knapsack

How many sampling passes do we need? Our bounds are in terms of the sum of item weight: if we let our sample size be in \(\Theta(\sqrt{n}),\) the sum of weights \(\sum_i |w_i|\) for unfathomed items (that may or may not be chosen depending on the exact optimal multiplier \(\lambda^\star\) in the current range) will very probably shrink by a factor of \(\Omega(n^{1/4}).\) The initial sum can, in the worst case, be exponentially larger than the bitlength of the input, so even a division by \(n^{1/4}\) isn't necessarily that great.

I intend to apply this Lazy Linear Knapsack algorithm on subproblems in a more interesting solver, and I know that the sum of weights is bounded by the size of the initial problem, so that's good enough for me! After a constant (\(\approx 4\)) number of passes, the difference in item weight between the lower and upper bound on \(\lambda^\star\) should also be at most 1. One or two additional passes will get me near optimality (e.g., within \(10^{-4}\)), and the lower bound on \(\lambda^\star\) should thus yield a super-optimal solution that's infeasible by at most \(10^{-4},\) which is, for my intended usage (again), good enough.

Given an optimal enough \(\lambda^\star,\) we can construct an explicit solution in one pass, plus a simple fixup for critical items. This Lazy Knapsack seems pretty reasonable for parallel or GPU computing: each sampling pass only needs to read the items (i.e., no partitioning-like shuffling) before writing a fraction of the data to a sample buffer, and we only need a constant number of passes (around 6 or 7) in the worst case.


  1. It's more like a fractional percentile, but you know what I mean: the value such that the distribution function at that point equals \(k / n\).

  2. Binomial bounds offer even stronger confidence intervals when the estimate is close to 0 or 1 (where Hoeffding's bound would yield a confidence interval that juts outside \([0, 1]\)), but don't impact worst-case performance.

  3. Thanks to Hoeffding's inequality, again.

  4. That's a troll. I think any self-respecting computer person would rather see it as a sort of pigeonhole argument.

  5. We're juggling a handful of error rates here. We're checking whether the success rate for the Lazy Knapsack sampling subroutine is at least as high as \(1 - \alpha,\) as requested in the test parameters, and we're doing so with another randomised procedure that will give an incorrect conclusion at most once every one million invocation.

  6. This classic Google study found 8% of DIMMs hit at least one error per year; that's more than one single-bit error every \(10^9\) DIMM-second, and they're mostly hard errors. More recently, Facebook reported that uncorrectable errors affect 0.03% of servers each month; that's more than one uncorrectable error every \(10^{10}\) server-second. If we performed one statistical test every nanosecond, the probability of memory failure alone would still dominate statistical errors by \(10^{20}!\)

21 Jan 2020 4:10am GMT

19 Jan 2020

feedPlanet Lisp

Joe Marshall: Afraid of Recursion

Here's a trick I learned several years ago for transferring time-series data. In the case in question, I needed to transfer a bunch of timestamped records, but the server had a quirk to it. If you asked for too many records at once, it would simply return an error code and give up on your request. There was no way to know beforehand how many records might exist in any given time span, so you could get an error code on nearly any request, unless it was for a very short time span. On the other hand, many requests for long time spans would succeed because they had few records in them. Despite this quirk, the code was really simple:

List<record> xfer (Timestamp start, Timestamp end) {
    try {
        return tryFetchRecords(start, end);
    } catch (TooManyRecordsException e) {
        Timestamp mid = (start + end)/2;
        List<record> firstHalf = xfer (start, mid);
        List<record> secondHalf = xfer (mid, end);
        return firstHalf.addAll(secondHalf);
    }
}

On any request, if the server returned the error code, we would simply bisect the time span, recursively ask for each half separately, and combine the two halves. Should the bisected time span still contain too many records, the time span would be bisected again. The recursion would continue until the time span was small enough that the server could honor the request and return some records. The recursion would then unwind, combining the returned records into larger and larger lists until we had all the records for our original time span. Since the time span would be cut in half on each recurrence, the depth of recursion would be proportional to the logarithm (base 2) of the total number of records, which would be a reasonably small number even with an enormous number of records.

It's certainly possible to avoid recursion and do this iteratively by some form of paging, but the code would be slightly more complex. The server is not very cooperative, so there is no easy way to determine an appropriate page size beforehand, and the server doesn't support a "paging token" to help keep track of progress. The recursive solution finds an appropriate transfer size by trial and error, and keeps track of progress more or less automatically. An iterative paging solution would have to do these things more explicitly and this would make the iterative code a bit more complex. And why add any complexity when it isn't really necessary?

I thought this solution was really cool when I first saw it. I've used this trick for transferring time series data many times. It makes the server very simple to write because the protocol requires so little of it. It simply has to refuse to answer requests that return too many results. The client code is just about the 10 lines above.

But when I first suggest this code to people I usually get "push back" (that is, until they see it work in action, then they usually get on board with it). People seem unsure about the use of recursion and want a more complex protocol where the client and server negotiate a page size or cooperatively pass a paging token back and forth on each request and response. Their eyes glaze over as soon as they see the recursive call. They want to avoid recursion just because it's recursive.

I've seen "aversion to recursion" happen in a lot of circumstances, not just this one. Recursion isn't the solution to everything. No tool solves all problems. But it is an important tool that often offers elegant solutions to many problems. Programmers shouldn't be afraid of using it when it is appropriate.

19 Jan 2020 11:52pm GMT

18 Jan 2020

feedPlanet Lisp

Joe Marshall: Unsyndicated blog

I've noticed that my blog posts are replicated in Planet Lisp and Planet Scheme, and here I am spamming them with random math stuff. So I'm creating a new blog, Jrm's Random Blog, where I can feel free to post about math, science, computers in general, and whatever else bugs me, without spamming the Lisp and Scheme readers. I'll keep posting to Abstract Heresies, but try to keep it more Lisp and computer language focused.

18 Jan 2020 1:33pm GMT

16 Jan 2020

feedPlanet Lisp

Joe Marshall: Groups, semigroups, monoids, and computers

The day after I rant about mathematicians, I make a math post. "Do I contradict myself? Very well, then, I contradict myself, I am large, I contain multitudes." - Walt Whitman

A group is a mathematical concept. It's pretty simple. It consists of a set, G, and an operation, *, which can be used to combine any two elements of G. What the set contains is not that important. It is the * operation we're interested in, and we can usually swap out G for another set without causing too many problems other than having to change the type signature of *. There are four axioms that * must obey

Frequently you run into something that obeys all the axioms but the inverse element axiom. This is called a monoid. A monoid is very much like a group except that you can get "stuck" when manipulating it if you run into one of the non-invertible elements because there's no inverse to "undo" it. There are certain things about monoids that are true only "if the appropriate inverses exist". You run into that qualifier a lot when dealing with monoids. You don't need that qualifier if you are dealing with a group because they do exist by axiom. Or we could say that calling something a group is simply shorthand for adding "if the appropriate inverses exist" everywhere.

What does this have to do with computers? Consider the set of all subroutines with the operation of concatenation. It is closed - concatenating two subroutines gives you a third subroutine. It is associative - you just concatenate them linearly. There is an identity element, usually called no-op. And many, but not all, subroutines have inverses. So we have a monoid.

Consider the set of all strings with the operation of concatenation. It is closed, associative, the empty string is the identity element. It is a monoid.

Consider the set of functions whose input type is the same as the result type with the operation of composition. It is closed, associative, the identity function is the identity element. It is a monoid. If we consider only the subset of functions that also have inverses, we have a group. This particular monoid or group comes in especially handy because composition of functions is so useful.

Consider the set of invertible 2x2 matrices with integer components, a determinant of 1 or -1, and the operation of matrix multiply. It is closed, associative, there is an identity matrix, and I already said just consider the invertible ones. It forms a group. This group comes in handy for implementing arbitrary precision arithmetic. (Thanks to Bradley Lucier for the correction of the condition on the determinant. This makes the matrix continue to have integer components upon inversion, keeping things closed.)

The permutations of a list form a group. The integers under addition form a group.

These things are everywhere. And it isn't a coincidence. The concepts of a group, monoid, and semigroup are meant to capture the essence of what it is to have a foldable sequence of elements. (Can I complain about mathematicians here? They make up so much terminology and abstraction that it is virtually impossible to get at what they really mean. We're just talking about sequences of elements and trying to find some minimal axioms that you need to have to fold them, but try to find literature that actually says that's what we're doing is like trying to pull hen's teeth.)

So what good are groups, monoids, and semigroups? Aside from the obvious fact that foldable sequences are ubiquitous and really useful, that is. Not immediately apparent from the axioms is that in addition to folding a sequence, you can transform a sequence into a different, but equivalent one. If the appropriate inverses exist (there's that phrase), you can "unfold" some or all elements of a sequence. So by judicious folding and unfolding, you can transform a sequence.

Here's an unusual abstract example. Consider a pipeline which has a set of nodes and communicates values of the same type between the nodes. Values accumulate at the nodes until they are transmitted to the next node in the pipeline. We start with all the values in the initial node (on the right) and transmit them to the left:

(pipeline (node) (node) (node a b c))  ;; transmit the a
(pipeline (node) (node a) (node b c))  ;; transmit the b
(pipeline (node) (node a b) (node c))  ;; transmit the a
(pipeline (node a) (node b) (node c))  ;; transmit the c
(pipeline (node a) (node b c) (node))  ;; transmit the b
(pipeline (node a b) (node c) (node))  ;; transmit the c
(pipeline (node a b c) (node) (node))  ;; done

If the values we transmit are drawn from a group, we can replace each node with the group's * operator:

(* identity identity (* a b c))  ;; transmit the a
(* identity (* identity a) (* b c))  ;; transmit the b
(* identity (* a b) (* identity c))  ;; transmit the a
(* (* identity a) (* identity  b) (* identity c))  ;; transmit the c
(* (* identity a) (* b c) identity)  ;; transmit the b
(* (* a b) (* identity c) identity)  ;; transmit the c
(* (* a b c) identity identity)  ;; done

The astute reader will notice that all we're doing is making use of the associativity axiom and moving the parenthesis around so that the values seem to move between the different nodes. But we preserve the invariant that the "value" of the entire pipeline doesn't change as the values move. The * operator need not be concatenate, which would give simple queuing behavior, but can be any operator satisfying the axioms giving us much more interesting pipelines. One implementation of arbitrary precision arithmetic transmits Möbius transformations along just such a pipeline to refine the upper and lower limits of a computed approximation. In this implementation, the * operator is the composition of Möbius transformations.

Here's a more concrete example. If you have a series of nested functions: (f (g x)) and both f and g take and return the same type, rewrite it as ((compose f g) x) and use a little group theory on it.

(f (g x))
((compose f g) x)
;; or more explicitly
((fold-left compose identity (list f g)) x)

If the appropriate inverses exist, then there will be another function h such that (compose f g) is equal to (compose h f) essentially allowing you to "slide" g to the left "through" f. It is relatively easy to see that h must be equivalent to (compose f g f-1). Mathematicians say that h is conjugate to g. Conjugates always have a form like aba-1. By finding conjugates, you can take a sequence and slide the elements left and right through other elements. This also allows you to fold things out of order. (Or in the pipeline example, transmit items out of order.) If we were left folding into an accumulator, folding h before f is equivalent to folding g after f. Another way of looking at it is this. Suppose we're standing to the left of f and looking through the "lens" of f at g. h is what g "looks like" when viewed through f.

If we want, we can define slide such that (compose slide (compose f g)) is equivalent to (compose h f). slide is (compose h f g-1 f-1). (This isn't a generic slide sequence, it only works on (compose f g). It ought to be an identity because (compose f g) is equivalent to (compose h f).) I complained that mathematicians provided too few concrete examples, so here is a concrete example using list permutations:

> (reverse (rotate-left '(a b c d)))
(a d c b)

;; rewrite as explicit fold-left of compose
> ((fold-left compose identity (list reverse rotate-left)) '(a b c d))
(a d c b)

;; sliding rotate-left through reverse turns it into rotate-right
> ((fold-left compose identity (list rotate-right reverse)) '(a b c d))
(a d c b)

;; A sequence that when composed with (list reverse rotate-left) turns it into
;; (rotate-right reverse)
> (define slide 
    (fold-left compose identity (list rotate-right reverse rotate-right reverse)))
slide

> ((fold-left compose identity (list slide reverse rotate-left)) '(a b c d))
(a d c b)

;; rewrite back to direct procedure calls
> (rotate-right (reverse '(a b c d)))
(a d c b)

;; and slide ought to be an identity
> ((fold-left compose identity (list slide)) '(a b c d))
(a b c d)


Or suppose you have (f (g x)), but for some reason you want(g (f x)) (which would, in general, be a different value unless f and g happen to commute). Again, rewrite (f (g x)) as ((compose f g) x) and apply a little group theory. If the appropriate inverses exist, there will be a function commute-fg such that (compose commute-fg (compose f g)) is equivalent to (compose g f). With a little thought, you can see that commute-fg is equivalent to (compose g f g-1 f-1). (Again, this isn't a generic commute, it only causes this specific f and g to commute.) commute-fg is called a commutator because it makes f and g commute. Commutators always have the form aba-1b-1. By finding commutators and inserting them in the right place, you can take a sequence and swap adjacent elements. Again, a concrete example with lists:

;; an illustration of what swap-first two does
> (swap-first-two '(a b c d))
(b a c d)

;; we're given
> (reverse (swap-first-two '(a b c d)))
(d c a b)

;; but we want, for some reason to reverse first
> (swap-first-two (reverse '(a b c d)))
(c d b a)

;; rewrite as fold-left of compose
> ((fold-left compose identity (list reverse swap-first-two)) '(a b c d))
(d c a b)

;; define our commutator
;; note that swap-first-two and reverse are their own inverses
> (define commute-fg
    (fold-left compose identity (list swap-first-two reverse swap-first-two reverse)))

;; make f and g commute
;; observe that it returns the desired result
> ((fold-left compose identity (list commute-fg reverse swap-first-two)) '(a b c d))
(c d b a)


There's two interesting things here. First, notice that in both examples I convert (f (g x)) to ((fold-left compose identity (list f g)) x) and then proceed to ignore x and just consider (fold-left compose identity (list f g)) as if x didn't exist. I've abstracted away the x. (Of course I have to eventually supply the x if I want an answer, but it only comes back at the last moment.) Second, notice that although slide and commute-fg are foldable sequences, I use them as if they were higher order functions operating on the foldable sequence (compose f g) to transform it, first into (compose h f), second into (compose g f). This second thing is a neat trick. We're taking a function that operates on lists and treating it as if it were a higher-order function that operates on functions. This is called the "action" of slide and commute-fg because it appears as if elements of the set G of our group can "act" directly on other elements.

Every element in the underlying set G of a group has an action associated with it which operates directly on other elements in G. This is an important concept in group theory. Now earlier I said that the actual elements of G don't matter much, so the action must be more closely tied to the operator *. And if we swap out G for another set we'll still have the same actions, they'll just be associated with the elements of the new set (in an isomorphic way). The actions are pretty abstract.

There's a lot more one could say about the actions. They are a rich source of interesting math. My brain is getting fatigued with all this abstraction, so I'll leave the topic be for now.

If group theory is about the essence of what it means to have a foldable sequence, then category theory is about the essence of composition. They offer two somewhat different approaches to similar material. What do you do with sequences but compose them? What comes from composition but a sequence? Many concepts in group theory carry over into category theory. Naturally a completely different set of terminology is used, but the concepts are there.

But that's enough group theory for today and category theory can wait until later posts.

16 Jan 2020 1:48pm GMT

15 Jan 2020

feedPlanet Lisp

Joe Marshall: Math is hard, let's go shopping

I find mathematics, with all it's weird terminology and abstraction and equations, hard to understand. That's kind of funny coming from someone like me who makes a living from a branch of mathematics. I find computers and programming to be rather easy to understand - probably because I've had a lot of practice. But computer science is just applied logic and programming is arguably just the study of the computable functions, so you'd think math would come naturally. It doesn't.

One problem I've found is that as much as mathematicians pride themselves on rigor, they tend to be a bit sloppy and leave out important details. Computer scientists don't leave out important details because then the programs won't run. It's true that too much detail can clutter things up, but leaving out the detail and relying on "context" just increases the intellectual burden on the reader.

I will give mathematician's credit for thinking about edge cases perhaps more than a computer scientist would. It can be easy to be a bit complacent with edge cases because the computer will likely do something even if you don't think too hard about what it ought to do. But a good computer scientist tries to reduce the number of edge cases or at least make them coherent with the non-edge cases.*

Mathematicians seem to take perverse pleasure in being obscure. Computer scientists strive to be as obvious as possible because like as not, they are the ones that have to revisit the code they wrote and don't want to have to remember what they were thinking at the time. It's just easier to spell things out explicitly and obviously so that you can get back up to speed quickly when you have to debug your own stupid code. Every time I pick up some literature on category theory, I get hit with a "Wall of Terminology" denser than the "Wall of Sound" on a Phil Spector recording. It's fundamentally simple stuff, but it is dressed up in pants so fancy one has a hard time extracting the plain meaning. What seems to be universal in category theory is my difficulty in getting past page 4.

I once read a mathematical paper that talked about an algorithm with three tuning parameters: α, β, and another α. No decent computer programmer would give the same name to two different variables. Which α was which was supposed to be "obvious" from the context. The brainpower needed to keep track of the different αs was absurd and a complete waste of effort when calling the variable something else, like γ would have done the trick.

And don't ask a mathematician to write computer code. That's the one time they'll leave out all the abstraction. Instead of a nice piece of abstract, functional code, you'll get a mess of imperative code that smashes and bashes its way to a solution with no explanation of how it got there. It's a lot easier to take some abstract, functional code and figure out a more optimal way, probably imperative way to do it than it is to take a more optimal imperative piece of code and figure out the abstract, functional meaning of it.

I've found it to be extremely helpful when a computer paper includes one or two concrete examples of what it is talking about. That way, if I try to go implement code that does what the paper suggests, there's some indication that I'm on the right track. I'm more confident that I understand the paper if I have working code that produces the exact same values the paper's authors got. It's harder to find concrete examples in a math paper, and it is easier to think you know what it says but be far off base if there aren't any examples.

Maybe I shouldn't blame mathematicians so much and look a little closer to home. Perhaps I should study harder instead of demanding to be spoon fed difficult concepts. But then I read Feynman, S&ICP, S&ICM, and Jaynes and discover that maybe I just need a simple explanation that makes sense to me.

Sturgeon's Revelation is "90% of everything is crap". This is true of both mathematical papers and computer science papers.



*An old joke illustrates the importance of thinking of edge cases: A programmer implements a bar. The test engineer goes in and orders a beer, orders zero beers, orders 999999999 beers, orders -1 beers, orders a lizard, and declares the bar ready for release. The first customer comes in and asks to use the restroom. The bar catches fire and burns down.

15 Jan 2020 2:55pm GMT

Charles Zhang: SBCL20 in Vienna

Last month, I attended the SBCL20 workshop in Vienna. Many thanks to the organizers and sponsors for inviting me to give a talk about my RISC-V porting work to SBCL and allowing me to otherwise throw around some ideas in the air with a bunch of SBCLites.

This was my first Lisp conference. It was really nice being able to meet a lot of people who up until then had only been floating voices on the internet. Given the location, it's no surprise that most of the attendees were European, but what did surprise me was the actual turnout for the event, some even having attended SBCL10. I, like what it seems to be many others, was certainly not expecting around 25 to attend. (Robert Smith had given a paltry estimate of about 5!)

On Sunday we had a nice tour around some cool places around Vienna by our gracious host, Phillip Marek. I got to the group right as they were in Donauturm, and had lunch afterwards. We then moved to Karlsplatz where Phillip hunted for daguerreotypes. Fortune looked down upon us that day, since it was a nice sunny 10°C in Vienna in the middle of December!

Then on Monday, we proceeded to start the workshop proper, at about 8:30 am. We were hosted by the Bundesrechnenzentrum (the Austrian Federal Computing Center), and accordingly, after Christophe kicked off the workshop, had some BRZ representatives talk about how the work they do combats things like tax fraud in Austria. We had a nice room with a lot of space, mixer-style tables and snacks in the back of the room. At first, the schedule was that Douglas Katzman was to go Monday morning, and Robert Smith, in the afternoon, with my talk scheduled for Tuesday morning. I ended up asking Robert if he would switch with me as I was pretty anxious to get my talk over with that day... And thus we pressed forward into our first talk of the day, maybe at around 10:30 am.

SBCL & Unix

Doug Katzman talked about his work at Google getting SBCL to work with Unix better. For those of you who don't know, he's done a lot of work on SBCL over the past couple of years, not only adding a lot of new features to the GC and making it play better with applications which have alien parts to them, but also has done a tremendous amount of cleanup on the internals and has helped SBCL become even more Sanely Bootstrappable. That's a topic for another time, and I hope Doug or Christophe will have the time to write up about the recent improvements to the process, since it really is quite interesting.

Anyway, what Doug talked about was his work on making SBCL more amenable to external debugging tools, such as gdb and external profilers. It seems like they interface with aliens a lot from Lisp at Google, so it's nice to have backtraces from alien tools understand Lisp. It turns out a lot of prerequisite work was needed to make SBCL play nice like this, including implementing a non-moving GC runtime, so that Lisp objects and especially Lisp code (which are normally dynamic space objects and move around just like everything else) can't evade the aliens and will always have known locations.

Now it's time for questions, and hacking around until the next talk! (oh, wait a second...) Christophe had encouraged us all to 'go forth and produce something' in the meantimes, but I needed to add a few more examples to my slides and eat something before I gave my talk. We had some cold sandwiches of various types for the day, and people started working on various projects.

RISC-V porting talk, VOPs

Around 1:10 pm or so, I went up to the podium to get my laptop set up for the talk. The HDMI cable when plugged into my laptop directly didn't work, but curiously fitting the HDMI cable through a USB3 converter and connecting that to my laptop made the projector work. Anyway, I got a laser pointer, which, now that I think back on it, probably waved around way too much and was probably fairly distracting. The slides are now posted on the website if you're curious what I talked about. I ended up following it pretty closely and was unsure how much detail to get into because I wasn't sure of the audience's familiarity with the SBCL internals, which porting a new backend is usually going to get pretty deep into.

There was general knowledge of the internal VOP facility in SBCL though, which is usually defined for a given backend to translate the generic machine independent low level intermediate representation (IR2 in internals parlance, VMR (virtual machine representation) in "public" internals documentation parlance) to target-specific machine code. Lots of people want to write their own inline assembly and integrate them with high level Lisp functions (with register allocation done for them), usually so they can use some hardware feature SBCL doesn't expose at a high level directly. For example, SIMD instructions are popular for number crunching people to want to use with SBCL. Well, VOPs are a nice way to do this, so that's why so many people knew about them. Except for VOP lifetimes. Lots of people were also confused about VOP lifetimes. The only reason I ended up understanding VOP lifetimes was because I had debugged too many backend issues where the register allocator destroyed a register I needed the value of. In fact, the first patches I got (from Phillip Mathias Schäfer) for SB-ROTATE-BYTE support on RISC-V had lifetime issues, which I fixed before merging. And, incidentally, right after my talk, Doug showed me a tool written by Alastair Bridgewater called voplife.el that visualizes VOP lifetimes and told me that he never writes VOPs without that tool. Well, that would've been nice to have! And then Christophe told me that of course the tool didn't exist when he was doing backend work.

Speaking of 'back in my day', in my slides I gave out (to use an Irish expression) about how long bootstrapping took with the emulator. Christophe proceeds to tell me about his experience porting to HPPA machines in the early 2000's where it took about a full day to wait for the system to bootstrap... It's easy to forget that Moore's law happens (happened?) sometimes.

Oh, and just so I remember for the future, I got some questions from Tobias Rittweiler about how I handled memory model issues. I basically said I didn't, because I was porting a new CPU, not an OS, since the Linux support routines handle almost all of those concerns. Then Doug asked me about why Load-Immediate-64 on RISC-V was so complicated: couldn't I have just loaded a word from memory? To which I responded that it's not clear whether its more expensive to load a word from memory versus materialize it with only register operations. This is a problem they solved in the RISC-V GCC backend, and last time I checked, the RISC-V backend for LLVM just punts and does the basic, unoptimized sequence. Then he asked me why I started with Cheney GC, which I deflected straight away to Christophe, who made the initial decision. He basically said, "it's easy to fit Cheney GC entirely in my head at once." Fair.

Monday lightning talks

After the talk we had some more time to work on stuff, like demos for the lightning talks. I didn't really do much besides talking to people about our internals though. Rui from 3e asked me about supporting tracing through local functions. One of my main interests is actually optimizing away local functions, so, I'm probably not the one who's going to implement it (unless you paid me to), but it seems fairly straightforward to port the support which was added to CMUCL after the fork.

Then we had our Monday afternoon lightning talks. I remember this just being a 'if you have something to say or demo come up and do it' kind of thing. Marco Heisig went up first to talk about SB-SIMD. I had helped him debug getting the VOPs installed into SBCL properly a little bit before his demo, and he showed us some cool support he's adding. He ended up sending a follow up email after the conference with a more formal proposal to integrate it into SBCL. I hope he has the time to move it forward and have it in-tree in some form or another.

Then james anderson, in what is a very memorable moment for me, went up for his 'demo' which ended up being a quite punctuated proclamation: 'We need concurrent GC!' Indeed, we do.

I'm already starting to forget the details of the remaining talks on Monday. Folks who were there, help me remember!

Dinner

We had an official SBCL20 dinner afterwards, and it was time for some Austrian food. I sat in front of Marco and next to Luís Oliveira and enjoyed some Viennese schnitzel. I asked for tap water and got something that looked like but was definitely not tap water...

SBCL & quantum computing

Tuesday morning was a similar drill. We had a new (smaller) room, and this time, we needed our passports for access. Tuesday was lightning talk day, but first, Robert Smith gave a talk about how they use SBCL for quantum computing at Rigetti. They have a state-of-the-art quantum compiler and a quantum simulator, but Robert first gave us a quick primer on some physics and math (including tensor products in a concrete way, which was a nice breath of fresh air after I had what seemed like endless classes characterizing it according to its universal property). His slides are online, check it out! He's also interested in making SBCL play nice with aliens, but in a different way than Doug is. For one thing, he's interested in making an ECL-like API for SBCL to expose their quantum compiler code compiled with SBCL as a traditional C API. What really struck me in his talk was their compiler's ability to propagate fidelity of qubits to make the compiler sometimes 'miscompile' to sometimes get a more 'accurate' answer. (Scarequotes because quantum is spooky.)

Also, they rewrote one of their Lisp applications into another language due to outside pressure, but the rewrite was slower. It's also cool to know that, according to him, most of Robert's team actually did not have much Lisp exposure before joining, giving a sense that the community is still kicking.

Tuesday lightning talks

We proceeded to hack on more stuff after the talk and had a hot meal for lunch this time. I actually started working on something this time. A conversation with Doug the previous day had me saying that we do loop invariant code motion and stuff, to which Doug said, "but we don't." So, I looked and he was right, although I was genuinely surprised because it is a transformation our compiler framework easily supports. We do do a lot of traditional optimizations, in addition to some state of the art dynamic language type inference (stuff all written in the late 80's!) since our intermediate representations are well suited for that sort of thing. In fact, SBCL's front-end intermediate representation is essentially CPS in a flow graph, which anticipates a lot of research in the area done in the 90's and 2000's, basically being locally equivalent to SSA and not falling into the trap of being bound to scope trees.

So I started working on loop invariant code motion, and while I didn't quite finish by the end of the conference, I did get a proof of concept afterwards that almost self builds and works alright. Though after discovering some ancient notes by the original implementer (Rob MacLachlan) on the issue, I've decided I took the wrong approach after all. (The issue is that I worked on the level of IR1 instead of IR2.) Oh well.

Meanwhile, we had lightning talks starting around 1:00 pm with a short break at around 2:45 pm, if I recall correctly. The full list of topics that day is on the website, in order. We descended into a bit of a wishlist game, with Phillip talking about where to move hosting for SBCL. (The options were, stay with SourceForge, move to GitHub, move to GitLab, move to common-lisp.net hosted GitLab. It was honestly quite the controversy.) Then I talked about the loop invariant code motion work I was doing briefly, and then asked the audience who has heard of Block Compilation. I don't remember the exact number, but I think there were more who didn't know than who knew. After complaining a little about how SBCL doesn't have it even though CMUCL does, I made it one of my big wishlist item, since I think that the ability to do whole program optimization is pretty important for a high-performance compiler, especially for a dynamic language like Lisp where most of the dynamic facilities go unused once an application is up and running in production (usually). Well, I ended up (re)implementing it yesterday, so maybe people will learn about it again. I might write up about it sooner or later. Then Stelian Ionescu talked about his wishlist items (such as gutting out a lot of the 'useless' backends) and we opened it up to the floor.

Wrap-up

After the official end of the conference, most of the crew went across the street into a mall to eat and chat for the rest of the night. Doug ended up showing me some cross disassembler stuff after some prompting about its removal, while Luís did a great job getting relocatable-heaps working on Windows next to us, which he promptly got upstreamed after the workshop. Great to see that new projects were motivated and finished as a result of SBCL20. It was a fun time, and, as Zach Beane said, I'm hoping we organize and meet again soon!


15 Jan 2020 2:51am GMT

14 Jan 2020

feedPlanet Lisp

Joe Marshall: Palindromes, redux, and the Sufficiently Smart Compiler

The Sufficiently Smart Compiler is mentioned by authors as shorthand for "a compiler that performs nearly all reasonable optimizations, but in particular this one I want". Many attempts were made up through the 80's and maybe into the 90's to write a Sufficiently Smart Compiler that would perform all "reasonable" optimizations, and although many impressive results have been obtained, there always seem to be fairly obvious optimizations that remain unoptimized. These days it seems that people realize that there will be good compilers and some very good compilers, but never a Sufficiently Smart Compiler. Nonetheless, it is worth considering a Sufficiently Smart Compiler as a tool for thought experiments.

I was curious what would be necessary for a Sufficiently Smart Compiler to generate optimal code for the palindrome problem given the naive algorithm.

The naive algorithm is inspired by the axioms

and gives us this:

(define (palindrome1? string)
  (or (< (string-length string) 2)
      (and (char=? (string-ref string 0)
                   (string-ref string (- (string-length string) 1)))
           (palindrome1? (substring string 1 (- (string-length string) 1))))))


The higher performing algorithm is inspired by the idea of keeping two pointers to each end of a string and comparing the characters at the pointers. If the characters are the same, you move the pointers inward and when they meet, you have seen a palindrome. If at any point the characters differ, you don't have a palindrome:

(define (palindrome2? string)
  (define (scan front-pointer rear-pointer)
    (or (>= front-pointer rear-pointer)
        (and (char=? (string-ref string front-pointer)
                     (string-ref string rear-pointer))
             (scan (+ front-pointer 1) (- rear-pointer 1))))
  (scan 0 (- (string-length string) 1)))

As you can see, these really aren't very different to start with. Both algorithms are iterative and both work their way in from the outside of the string. There are basically two differences. First, access to the rear of the string is either by a rear pointer, or by using the string-length of the string and subtracting 1. Second, the iterative call either uses substring or moves the pointers closer together.

First, let's assume that our processor has can reference through an indexed offset. This would mean we could point at the element one beyond the rear-pointer and not incur overhead. This isn't an unreasonable assumption for a CISC architecture such as an x86, but would probably cause 1 instruction overhead on a RISC architecture. So the second algorithm becomes this:

(define (palindrome2? string)
  (define (scan front-pointer rear-pointer)
    (or (< (- rear-pointer front-pointer) 2)
        (and (char=? (string-ref string front-pointer)
                     (string-ref string (- rear-pointer 1)))
             (scan (+ front-pointer 1) (- rear-pointer 1)))))
  (scan 0 (string-length string)))


Now this next assumption is a bit more of a stretch. The implementation of palindrome1? uses substring on each iteration and that's going to result in a lot of string copying. If our implementation used "slices" instead of copying the string, then there will be a lot less copying going on:

(define (palindrome1? string)
  (or (< (- (slice-end string) (slice-start string)) 2)
      (and (char=? (string-ref string (slice-start string))
                   (string-ref string (- (slice-end string) 1)))
           (palindrome1? 
             (substring string (+ (slice-start string) 1) (- (slice-end string) 1))))))


It is not uncommon for a compiler to introduce internal procedures for looping, so we can do that.

(define (palindrome1? string)
  (define (scan slice)
    (or (< (- (slice-end slice) (slice-start slice)) 2)
        (and (char=? (slice-ref slice (slice-start slice))
                     (slice-ref slice (- (slice-end slice) 1)))
             (scan (subslice slice (+ (slice-start slice) 1) (- (slice-end slice) 1))))))
  (scan (make-slice 0 (string-length string))))


We'll enter fantasy land again and let our compiler be smart enough to "spread" the slice data structure into the argument list of scan. This is no doubt asking too much from our compiler, but the information is available and it could in theory be done:

(define (palindrome1? string)
  (define (scan slice-start slice-end)
    (or (< (- slice-end slice-start) 2)
        (and (char=? (slice-ref string slice-start)
                     (slice-ref string (- slice-end 1)))
             (scan (+ slice-start 1) (- slice-end 1)))))
  (scan 0 (string-length string)))


And now we have palindrome2? (modulo renaming).

This doesn't really prove anything. But with a couple of somewhat unlikely compiler tricks, the naive version could be transformed to the more optimized version. It suggests that a it would be surprising but not a complete shock for an ambitious compiler writer to attempt.

I wish someone would write that Sufficiently Smart Compiler.

14 Jan 2020 12:09pm GMT