18 Sep 2014

feedPlanet Lisp

Joe Marshall: A useful, if somewhat pointless, trick with homographic functions

In my previous posts I showed that if you are applying a homographic function to a continued fraction, you can partly evaluate the answer before you completely apply the function. Instead of representing homographic functions as lambda expressions, I'll represent them as a list of the coefficients a, b, c, and d in (lambda (t) (/ (+ (* a t) b) (+ (* c t) d))). I'll represent a simple continued fraction as a stream of the integer terms in the denominators.
Here is how you partly apply a homographic function to a continued fraction:

(define (partly-apply hf cf)
  (let ((a (first  hf))
        (b (second hf))
        (c (third  hf))
        (d (fourth hf)))
    (if (empty-stream? cf)
        (values (list a a
                      c c)
                cf)
        (let ((term (head cf)))
          (values (list (+ (* a term) b) a
                        (+ (* c term) d) c)
                  (tail cf))))))

Partly evaluating a homographic function involves looking at the limits of the function as t starts at 1 and goes to infinity:

(define (partly-evaluate hf)
  (let ((a (first hf))
        (b (second hf))
        (c (third hf))
        (d (fourth hf)))

    (if (and (same-sign? c (+ c d))
             (let ((limit1 (quotient      a       c))
                   (limit2 (quotient (+ a b) (+ c d))))
               (= limit1 limit2)))
        (let ((term (quotient a c)))
          (let ((new-c (- a (* c term)))
                (new-d (- b (* d term))))
            (values term (list c d new-c new-d))))
        (values #f #f))))

We can combine these two steps and make something useful. For example, we can print the value of applying a homographic function to a continued fraction incrementally, printing the most significant digits before computing further digits.

(define (print-hf-cf hf cf)
  (call-with-values (lambda () (partly-evaluate hf))
    (lambda (term hf*)
      (if (not term)
          (call-with-values (lambda () (partly-apply hf cf))
            print-hf-cf)
          (begin
            (display term) 
            ;; take reciprocal and multiply by 10
            (let ((a (first hf*))
                  (b (second hf*))
                  (c (third hf*))
                  (d (fourth hf*)))
              (print-hf-cf (list (* c 10) (* d 10)
                                 a        b)
                           cf)))))))

But how often are you going to apply a homographic function to a continued fraction? Fortunately, the identity function is homographic (coefficients are 1 0 0 1), so applying it to a continued fraction doesn't change the value. The square root of 2 is a simple continued fraction with coefficients [1 2 2 2 ...] where the 2s repeat forever. We apply the identity homographic function to it and print the results:

(printcf (list 1 0 0 1) sqrt-two)
14142135623730950488016887242096980785696718^G
; Quit!

As you can see, we start printing the square root of two and we don't stop printing digits until the user interrupts.

A fancier version could truncate the output and print a decimal point after the first iteration.

18 Sep 2014 8:06pm GMT

17 Sep 2014

feedPlanet Lisp

Zach Beane: Common Lisp bits

Heinrich Apfelmus has updated to the source code from Computer Models of Musical Creativity and put it on github. Looks like it's meant to work with RMCL.

"CEPL is an extension for common lisp that makes working with OpenGL simple and familiar," according to Baggers. There is a blog and a number of videos about CEPL. The readme cautions: PRE-ALPHA.

"BG" gives a take on the history of Macintosh Common Lisp. Rainer Joswig responded to a number of points in the ensuing /r/lisp discussion.

3bgl-shader is "a Common Lisp DSL for generating GLSL shaders," by Bart Botta. Needs people to try it out and provide feedback.

Pseudo is a Lisp-powered roguelike multiplayer browser game, with AGPLv3-licensed source code available. Created by Matthew Carter.

The Infected is a roguelke survival horror game in Common Lisp, by Jan Tatham.

Mariano Montone writes about embedding Python syntax (and functionality) in Common Lisp sources.

17 Sep 2014 8:56pm GMT

14 Sep 2014

feedPlanet Lisp

Quicklisp news: September 2014 Quicklisp dist update now available

New projects:

Updated projects: bknr-datastore, caveman, cl-ana, cl-async, cl-conspack, cl-css, cl-gendoc, cl-gss, cl-inflector, cl-oauth, cl-olefs, cl-quickcheck, cl-redis, cl-sdl2, cl-tld, clip, closer-mop, coleslaw, colleen, crane, crypto-shortcuts, function-cache, gbbopen, hermetic, hu.dwim.walker, let-over-lambda, lisp-unit2, lquery, mel-base, mexpr, mgl-pax, modularize, modularize-hooks, modularize-interfaces, mpc, open-vrp, pgloader, plump, policy-cond, protobuf, qmynd, repl-utilities, restas, scriptl, shelly, smug, software-evolution, south, staple, stumpwm, trivial-mimes, weblocks-tree-widget.

To get this update, use (ql:update-dist "quicklisp").

Just as a friendly reminder, Quickdocs is a great way to find libraries in Quicklisp. I don't run the site and it's not an official part of Quicklisp, it's just a great project that uses Quicklisp's metadata to build a really useful service. So check it out!

14 Sep 2014 9:29pm GMT

Timofei Shatrov: Who needs graph theory anyway?

In my last post I discussed how to make a Japanese->English transliterator and outlined some problems that limited its usefulness. One problem is that there's no obvious way to segment a sentence into words. I looked up existing solutions, and a lightweight Javascript implementation caught my eye. I quickly ported it to Common Lisp and to the surprise of absolutely no one, the results were awful

It was clear that I needed an actual database of Japanese words to do segmentation properly. This would also solve the "kanji problem" since this database would also include how to pronounce the words. My first hunch was Wiktionary, but it's dump format turned out to be pretty inefficient for parsing.

Fortunately I quickly discovered a free JMDict database which was exactly what I needed. It even had open-source code in Python for parsing and loading its XML dumps. Naturally, I wrote my own code to parse it since its database schema looked too complex for my needs. But I'm not going to discuss that in this post, as it is quite boring.

Since now I had a comprehensive Postgres database of every word in Japanese language (not really, as it doesn't include conjugations) it was only a matter of identifying the words in the sentence. To do this, for every substring of a sentence look up the database for exact matches. There are n(n+1)/2 substrings in a string, so we aren't doing too badly in terms of performance (and the string wouldn't be too long anyway since prior to running this procedure I'll be splitting it by punctuation etc.)

(defstruct segment
  start end word))

(defun find-substring-words (str)
  (loop for start from 0 below (length str)
       nconcing 
       (loop for end from (1+ start) upto (length str)
          for substr = (subseq str start end)
            nconcing (mapcar 
                      (lambda (word)
                        (make-segment :start start :end end :word word))
                      (find-word substr)))))

The problem is that there's a lot of words, and many of them are spelled identically. I decided to assign a score to each word based on its length (longer is better), whether it's a preferred spelling of the word, how common the word is and whether it's a particle (which tend to be short and thus need a boost to increase their prominence).

Now we have the following problem: for a sentence, find the set of non-intersecting segments with the maximum total score. Now, you might have better mathematical intuition than I, but my first thought was:

This looks NP-hard, man. This problem has "travelling salesman" written all over it.

My first attempt to crack it was to calculate score per letter for each word and select words with the highest scores. But a counterexample comes to mind rather easily: in a sentence "ABC" with words "AB" (score=5), "BC" (score=5) and "ABC" (score=6), words "AB" and "BC" have a higher score per letter (2.5), but the optimal covering is provided by the word "ABC" with its score per letter a measly 2.

At this point I was working with the most convenient mathematical instrument, which is pen and paper. The breakthrough came when I started to consider a certain relation between two segments: the segment a can be followed by the segment b iff (segment-start b) is greater or equal to (segment-end a). Under this relation our segments form transitive directed acyclic graph. The proof is left as an exercise for the reader. Clearly we just need to do a transitive reduction and use something similar to Dijkstra's algorithm to find the path with the maximal score! This problem is clearly solvable in polynomial time!

Pictured: actual notes drawn by me

image

In reality the algorithm turns out to be quite simple. Since find-substring-words always returns segments sorted by their start and then by their end, every segment can only be followed by the segments after it. We can then accumulate the largest total score and the path used for it for every segment by using a nested loop:

(defstruct segment
  start end word (score nil) (accum 0) (path nil))

(defun find-best-path (segments)
  ;;assume segments are sorted by (start, end) (as is the result of find-substring-words)
  (let ((best-accum 0)
        (best-path nil))
    (loop for (seg1 . rest) on segments
       when (> (segment-score seg1) (segment-accum seg1))
         do (setf (segment-accum seg1) (segment-score seg1)
                  (segment-path seg1) (list seg1))
            (when (> (segment-accum seg1) best-accum)
              (setf best-accum (segment-accum seg1)
                    best-path (segment-path seg1)))
       when (> (segment-score seg1) 0)
         do (loop for seg2 in rest
               if (>= (segment-start seg2) (segment-end seg1))
               do (let ((accum (+ (segment-accum seg1) (segment-score seg2))))
                    (when (> accum (segment-accum seg2))
                      (setf (segment-accum seg2) accum
                            (segment-path seg2) (cons seg2 (segment-path seg1)))
                      (when (> accum best-accum)
                        (setf best-accum accum
                              best-path (segment-path seg2)))))))
    (values (nreverse best-path) best-accum)))

Of course when I actually tried to run this algorithm, SBCL just crashed. How could that be? It took me a while to figure out, but notice how segment-path contains a list that includes the segment itself. A recursive self-referential structure! When SBCL tried to print that in the REPL, it didn't result in dragons flying out of my nose but a crash still happened. Interestingly, Common Lisp has a solution to this: if *print-circle* is set to t, it will actually print the structure using referential tokens. Anyway, I just added the following before returning the result to remove self-references:

    (dolist (segment segments)
      (setf (segment-path segment) nil))

So, did it work? Yes, it did, and the result was impressive! Even though my scoring system is pretty barebones, it's on par or even better than Google Translate's romanization on a few test sentences I tried. I still need to add conjugations, and it can't do personal names at all, but considering how little code there is and the fact that it doesn't even attempt grammatical analysis of the sentence (due to me not knowing the language) I am very happy with the result. Also I plan to add a web interface to it so that it's possible to hover over words and see the translation. That would be pretty useful. The work in progress code is on my Github.

14 Sep 2014 12:41pm GMT

Paul Khuong: Doodle: Hybridising SBCL's GENCGC With Mark and Sweep

Meta-note: this is more of a journal entry than the usual post here. I'll experiment with the format and see if I like publishing such literal and figurative doodles.

Garbage collection is in the air. My friend Maxime is having issues with D's garbage collector, and Martin Cracauer has a large patch to improve SBCL's handling of conservative references. I started reviewing that patch today, and, after some discussion with Alastair Bridgewater, I feel like adding a mark-and-sweep component to SBCL's GC might be easier than what the patch does, while achieving its goal of reducing the impact of conservative references. That lead to the whiteboarding episode below and a plan to replace the garbage collecting half of SBCL's generational GC. But first, a summary of the current implementation.

The present, and how we got here

CMUCL started out with a Cheney-style two-space collector. Two-space collectors free up space for more allocations by copying objects that might still be useful (that are reachable from "roots," e.g., variables on the stack) from the old space to the new space. Cheney's algorithm elegantly simplifies this task by storing bookkeeping information in the data itself. When we copy an object to the new space (because it is reachable), we want to make sure that all other references to that object are also replaced with references to the copy. Cheney's solution to that desideratum is obvious: overwrite the old object with a broken heart (forwarding pointer), a marker that

  1. the object has already been copied to the new space;
  2. the copy lives at address x.

This adds a constraint that heap-allocated objects can never be smaller than a broken heart, but they're usually one or two words (two in SBCL's case) so the constraint is rarely binding.

When the garbage collector traverses the roots (the stack, for example) and finds a pointer, the code only has to dereference that pointer to determine if the objects it points to has been moved. If so, the GC replaces the root pointer with a pointer to the copy in the new space. Otherwise, the GC copies the object to the new space, repoints to that copy, and overwrites the old object with a broken heart.

We also need to traverse objects recursively: when we find that an object is live and copy it to the new space, we must also make sure that anything that objects points to is also preserved, and that any pointer in that object is updated with pointers to copies in the new space.

That's a graph traversal, and the obvious implementation maintains a workset of objects to visit which, in the worst case, could include all the objects in old space. The good news is we don't have to worry about objects re-entering that workset: we always overwrite objects (in old space) with a broken heart when we visit them for the first time.

Cheney proposed a clever trick to implement this workset. Whenever an object enters the workset, it has just been copied to the new space; as long as we allocate in the new space by incrementing an allocation pointer, the new space itself can serve as the workset! In addition to the allocation pointer, we now need a "to-scan" pointer. Any object in the new space that's below the to-scan pointer has already been scanned for pointers and fixed to point in the new space; any object between the to-scan pointer and the allocation pointer must be scanned for pointers to the old space. We pop an element from the workset by looking at the next object (in terms of address) after the to-scan pointer and incrementing that pointer by the object's size. When the to-scan and the allocation pointers meet, the workset is empty and GC terminates.

Some SBCL platforms still use this two-space collector, but it doesn't scale very well to large heaps (throughput is usually fine, but we waste a lot of space and GC pauses can be long). The generational conservative garbage collector (GENCGC, GENGC on precise/non-conservative platforms) is a hack on top of that Cheney GC.

The GC is "generational" because most passes only collect garbage from a small fraction of the heap, and "conservative" because we have to deal with values that may or may not be pointers (e.g., we don't always know if the value in a register is a Lisp reference or just a machine integer) by considering some objects as live (not up for collection) while pinning them at their current address.

The runtime uses mprotect to record writes to the heap, except for the nursery (newly allocated objects) where we expect most writes to land. The heap is partitioned in pages, and the first write to a page after a GC triggers a protection fault; the signal handler marks that page as mutated and changes the protection to allow writes.

When a GC is triggered, we usually want to collect only the nursery, i.e., only objects that were allocated after the previous GC pass. GEN(C)GC adapts Cheney to this use case by building the set of all pages that might have been mutated to point somewhere in the nursery (thanks to the mprotect write barrier) and scanning them for roots, like the stack in Cheney GC. The default GENGC configuration has 7 generations and we extend this scheme by flagging pages with pointers to younger generations (newer objects), without noting what these generations might be.

Pinned objects are also handled by abusing the root set: pages that contain at least one pinned object don't undergo garbage collection and are directly scanned for pointers, like the stack in Cheney GC.

Instead of having two heaps, an old space and a new space, we now have a lot of pages, and each page belongs to a generation. When we want to collect a given generation, pages in that generation form the old space, and pages allocated during GC the new space. This means that we lose the simplicity of Cheney's new-space-is-the-workset trick: the new space isn't contiguous, so a single to-scan pointer doesn't cut it anymore! GENGC works around that by scanning the page table, but it's not pretty and I really don't know if Cheney is a good fit anymore.

Martin Cracauer's patch

GENCGC's approach to pinned objects is stupid. If a page has no reference except for one conservative pointer, the whole page is considered live and scanned for references.

Martin's solution is to allocate additional temporary metadata only for pinned pages and track the pinned status of individual objects. When the GC encounters a pointer to a page with pinned objects, it checks if it's a pointer to a pinned object. If so, the pointee is left in place. Otherwise, it's copied normally.

The patch has code to mark objects as live (pinned) and to overwrite objects once they have been copied. Basically, it is half of a mark-and-sweep garbage collector. The main difference is that the set of pinned objects doesn't grow (being pinned isn't a contagious property), so we don't need a worklist for pinned objects. However, I already noted that I'm not convinced the worklist hack in GENGC is a good idea.

A hybrid collector!

Instead of marking pages as containing pinned objects, I feel it may be simpler to collect some pages by copying, and others by marking. Any pinned page would have the "mark" GC policy, while pages that likely contain few live objects (e.g., the nursery and pages with a lot of unused memory) would be collected by copying. This too would avoid the issue with considering whole pages as live when pinned, and I think that having the choice of copying or marking at a page granularity will be simpler than toggling at the level of individual object.

Each "mark" page now has two (bit)sets, one for live objects and another for live objects that have already been scanned. We can maintain a worklist at the page granularity with an embedded linked list: whenever a "mark" page gains a new live object and it's not already in the worklist, that page is enqueued for scanning.

Instead of emulating Cheney's trick by looking for newly allocated pages in our page table, we can add pages in new space to the worklist whenever they become full.

Finally, at the end of the pass, we traverse all "mark" pages and clear dead objects.

That's pretty simple (arguably simpler than the current implementation!), and shouldn't involve too many changes to the rest of the code. Mostly, I'd have to adapt the weak pointer machinery to stop assuming that it can use forwarding pointers to determine when objects have been marked as live.

However, we might lose the ability to run medium GCs, to collect more than the nursery but less than the whole heap. If we only want to GC the nursery, the mprotect write barrier gives us all the information we need to find references from the rest of the heap to the nursery. If we wish to collect the whole heap, we only have to consider stacks and some statically allocated space as roots.

For medium GCs, e.g., collect only generations 1-4 out of 7, GENGC exploits the way that garbage collection (almost) always copies to easily track pages with pointers to younger generations. It's coarse, but usually acceptable thanks to the copying. I don't know that it would work as well if the default is to only copy the nursery. Moreover, if we have a hybrid GC, it probably makes sense to focus copying on pages that are mostly empty, regardless of their age. If we do want medium GCs, we might have to track, for each page, the set of pages that point there. This set can include false positives, so it's probably easiest to clear it before major GCs, and otherwise only add to that set (removing pages that were emptied by a GC pass sounds reasonable). I also expect that some pages will have many refererrers; I'm thinking we might use a distinguished value to mean "referred by every pages" and not consider them for medium GC.

What's next

Martin's patch clearly addresses an important weakness in SBCL's garbage collector. If I can't make good progress on the hybrid GC soon, I'll make sure the patch is cleaned up for master, hopefully by Thanksgiving.

14 Sep 2014 2:56am GMT

12 Sep 2014

feedPlanet Lisp

Clozure CL Blog: Clozure CL 1.10 is available

Clozure CL 1.10 is now available. See http://ccl.clozure.com/download.html for instructions on how to get it.

12 Sep 2014 3:45pm GMT

08 Sep 2014

feedPlanet Lisp

Christophe Rhodes: naive vs proper code-walking

I said in my discussion about backquote representations that some utilities had defects made manifest by SBCL 1.2.2's new internal representation for backquote and related operators, and that those defects could have been avoided by using a code-walker. I'm going to look at let-over-lambda code here, to try to demonstrate what I meant by that, and show how a proper code-walker can quite straightforwardly be used for the code transformations that have been implemented using a naïve walker (typically walking over a tree of conses), removing whole classes of defects in the process.

The let-over-lambda code I'm discussing is from https://github.com/thephoeron/let-over-lambda, specifically this version. This isn't intended to be a hatchet job on the utility - clearly, it is of use to its users - but to show up potential problems and offer solutions for how to fix them. I should also state up front that I haven't read the Let over Lambda book, but it's entirely possible that discussing and using a full code-walker would have been out of scope (as it explicitly was for On Lisp).

Firstly, let's deal with how the maintainer of the let-over-lambda code is dealing with the change in backquote representations, since it's still topical:

;; package definition here just in case someone decides to paste
;; things into a Lisp session, and for private namespacing
(defpackage "LOL" (:use "CL"))
(in-package "LOL")
;; actual excerpts from let-over-lambda code from
;; <https://github.com/thephoeron/let-over-lambda/blob/a202167629cb421cbc2139cfce1db22a84278f9f/let-over-lambda.lisp>
;; begins here:
#+sbcl
(if (string-lessp (lisp-implementation-version) "1.2.2")
    (pushnew :safe-sbcl *features*)
    (setq *features* (remove :safe-sbcl *features*)))
(defun flatten (x)
  (labels ((rec (x acc)
             (cond ((null x) acc)
                   #+(and sbcl (not safe-sbcl))
                   ((typep x 'sb-impl::comma) (rec (sb-impl::comma-expr x) acc))
                   ((atom x) (cons x acc))
                   (t (rec (car x) (rec (cdr x) acc))))))
    (rec x nil)))

The issues around the (*features*) handling here have been reported at github; for the purpose of this blog entry, I will just say that I wrote about them in Maintaining Portable Lisp Programs, a long time ago, and that a better version might look a bit like this:

#+sbcl
(eval-when (:compile-toplevel :execute)
  (defun comma-implementation ()
    (typecase '`,x
      (symbol 'old)
      ((cons symbol (cons structure-object)) 'new)))
  (if (eql (comma-implementation) 'old)
      (pushnew 'cons-walkable-backquote *features*)
      (setq *features* (remove 'cons-walkable-backquote *features*))))
(defun flatten (x)
  (labels ((rec (x acc)
             (cond ((null x) acc)
                   #+lol::cons-walkable-backquote
                   ((typep x 'sb-impl::comma) (rec (sb-impl::comma-expr x) acc))
                   ((atom x) (cons x acc))
                   (t (rec (car x) (rec (cdr x) acc))))))
    (rec x nil)))

With these changes, the code is (relatively) robustly testing for the particular feature it needs to know about at the time that it needs to know, and recording it in a way that doesn't risk confusion or contention with any other body of code. What is the let-over-lambda library using flatten for?

(defun g!-symbol-p (thing)
  (and (symbolp thing)
       (eql (mismatch (symbol-name thing) "G!") 2)))
(defmacro defmacro/g! (name args &rest body)
  (let ((syms (remove-duplicates
               (remove-if-not #'g!-symbol-p (flatten body)))))
    `(defmacro ,name ,args
       (let ,(mapcar
              (lambda (s)
                `(,s (gensym ,(subseq (symbol-name s) 2))))
              syms)
         ,@body))))

The intent behind this macro-defining macro, defmacro/g!, appears to be automatic gensym generation: being able to write

(defmacro/g! with-foo ((foo) &body body)
  `(let ((,g!foo (activate-foo ,foo)))
     (unwind-protect
         (progn ,@body)
       (deactivate-foo ,g!foo))))

without any explicit calls to gensym but retaining the protection that gensyms give against name capture:

(macroexpand-1 '(with-foo (3) 4))
; => (let ((#1=#:FOO1 (activate-foo 3)))
;      (unwind-protect
;          (progn 4)
;        (deactivate-foo #1#)))

That's fine; it's reasonable to want something like this. Are there any issues with this, apart from the one exposed by SBCL's new backquote implementation? In its conventional use, probably not - essentially, all uses of g! symbols are unquoted (i.e. behind commas) - but there are a couple of more theoretical points. One issue is that flatten as it currently stands will look for all symbols beginning with g! in the macroexpander function source, whether or not they are actually variable evaluations:

(defmacro/g! with-bar ((bar) &body body)
  `(block g!block
     (let ((,g!bar ,bar)) ,@body)))
; unused variable G!BLOCK
(macroexpand-1 '(with-bar (3) 4))
; => (block g!block (let ((#:BAR1 3)) 4))

In this example, that's fair enough: it's probably user error to have those g! symbols not be unquoted; this probably only becomes a real problem if there are macro-defining macros, with both the definer and the definition using g! symbols. It's not totally straightforward to demonstrate other problems with this simple approach to Lisp code transformation using just this macro; the transformation is sufficiently minimal, and the symptoms of problems relatively innocuous, that existing programming conventions are strong enough to prevent anything seriously untoward going wrong.

Before getting on to another example where the problems with this approach become more apparent, how could this transformation be done properly? By "properly" here I mean that the defmacro/g! should arrange to bind gensyms only for those g! symbols which are to be evaluated by the macroexpander, and not for those which are used for any other purpose. This is a task for a code-walker: a piece of code which exploits the fact that Lisp code is made up of Lisp data structures, all of which are introspectable, and the semantics of which in terms of effect on environment and execution are known. It is tedious, though possible, to write a mostly-portable code-walker (there needs to be some hook into the implementation's representation of environments); I'm not going to do that here, but instead will use SBCL's built-in code-walker.

The sb-walker:walk-form function takes three arguments: a form to walk, an initial environment to walk it in, and a walker function to perform whatever action is necessary on the walk. That walker function itself takes three arguments, a form, context and environment, and the walker arranges for it to be called on every macroexpanded or evaluated subform in the original form. The walker function should return a replacement form for the subform it is given (or the subform itself if it doesn't want to take any action), and a secondary value of t if no further walking of that form should take place.

To do g! symbol detection and binding is fairly straightforward. If a symbol is in a context for evaluation, we collect it, and here we can take the first benefit from a proper code walk: we only collect g! symbols if the code-walker deems that they will be evaluated and there isn't an already-existing lexical binding for it:

(defmacro defmacro/g!-walked (name args &body body)
  (let* (g!symbols)
    (flet ((g!-walker (subform context env)
             (declare (ignore context))
             (typecase subform
               (symbol
                (when (and (g!-symbol-p subform)
                           (not (sb-walker:var-lexical-p subform env)))
                  (pushnew subform g!symbols))
                subform)
               (t subform))))
      (sb-walker:walk-form `(progn ,@body) nil #'g!-walker)
      `(defmacro ,name ,args
         (let ,(mapcar (lambda (s) (list s `(gensym ,(subseq (symbol-name s) 2))))
                       g!symbols)
           ,@body)))))

The fact that we only collect symbols which will be evaluated deals with the problem exhibited by with-bar, above:

(defmacro/g!-walked with-bar/walked ((bar) &body body)
  `(block g!block
     (let ((,g!bar ,bar)) ,@body)))
(macroexpand-1 '(with-bar/walked (3) 4))
; => (block g!block (let ((#:BAR1 3)) 4))

Only gathering symbols which don't have lexical bindings (testing sb-walker:var-lexical-p) deals with another minor problem:

(defmacro/g!-walked with-baz ((baz) &body body)
  (let ((g!sym 'sym))
    `(let ((,g!sym ,baz)) ,@body)))
(macroexpand-1 '(with-baz (3) 4))
; => (let ((sym 3)) 4)

(the cons-walker - flatten - would not be able to detect that there is already a binding for g!sym, and would introduce another one, again leading to an unused variable warning.)

OK, time to recap. So far, we've corrected the code that tests for particular backquote implementations, which was used in flatten, which itself was used to perform a code-walk; we've also seen some low-impact or theoretical problems with that simple code-walking technique, and have used a proper code-walker instead of flatten to deal with those problems. If the odd extra unused variable binding were the worst thing that could happen, there wouldn't be much benefit from using a code-walker (other than the assurance that the walker is dealing with forms for execution); however, let us now turn our attention to the other macro in let-over-lambda's code which does significant codewalking:

(defun dollar-symbol-p (thing)
  (and (symbolp thing)
       (char= (char (symbol-name thing) 0) #\$)
       (ignore-errors (parse-integer (subseq (symbol-name thing) 1)))))
(defun prune-if-match-bodies-from-sub-lexical-scope (tree)
  (if (consp tree)
      (if (or (eq (car tree) 'if-match)
              (eq (car tree) 'when-match))
          (cddr tree)
          (cons (prune-if-match-bodies-from-sub-lexical-scope (car tree))
                (prune-if-match-bodies-from-sub-lexical-scope (cdr tree))))
      tree))
;; WARNING: Not %100 correct. Removes forms like (... if-match ...) from the
;; sub-lexical scope even though this isn't an invocation of the macro.
#+cl-ppcre
(defmacro! if-match ((test str) conseq &optional altern)
  (let ((dollars (remove-duplicates
                  (remove-if-not #'dollar-symbol-p
                                 (flatten (prune-if-match-bodies-from-sub-lexical-scope conseq))))))
    (let ((top (or (car (sort (mapcar #'dollar-symbol-p dollars) #'>)) 0)))
      `(let ((,g!str ,str))
         (multiple-value-bind (,g!s ,g!e ,g!ms ,g!me) (,test ,g!str)
           (declare (ignorable ,g!e ,g!me))
           (if ,g!s
               (if (< (length ,g!ms) ,top)
                   (error "ifmatch: too few matches")
                   ;; lightly edited here to remove irrelevant use of #`
                   (let ,(mapcar (lambda (a1) `(,(symb "$" a1)
                                                (subseq ,g!str (aref ,g!ms ,(1- a1))
                                                               (aref ,g!me ,(1- a1)))))
                                 (loop for i from 1 to top collect i))
                     ,conseq))
               ,altern))))))
(defmacro when-match ((test str) conseq &rest more-conseq)
  `(if-match (,test ,str)
     (progn ,conseq ,@more-conseq)))

What's going on here? We have a prune-if-match-bodies-from-sub-lexical-scope function which, again, performs some kind of cons-based tree walk, removing some conses whose car is if-match or when-match. We have a trivial macro when-match which transforms into an if-match; the if-match macro is more involved. Any symbols named as a $ sign followed by an integer (in base 10) are treated specially; the intent is that they will be bound to capture groups of the cl-ppcre match. So it would be used in something like something like

(defun key-value (line)
  (if-match ((lambda (s) (scan "^\\(.*\\): \\(.*\\)$" s)) line)
      (list $1 $2)
      (error "not actually a key-value line: ~S" line)))

and that would macroexpand to, roughly,

(defun key-value (line)
  (multiple-value-bind (s e ms me)
      ((lambda (s) (scan "^\\(.*\\): \\(.*\\)$" s)) line)
    (if s
        (if (< (length ms) 2)
            (error "if-match: not enough matches)
            (let (($1 (subseq line (aref ms 0) (aref me 0)))
                  ($2 (subseq line (aref ms 1) (aref me 1))))
              (list $1 $2)))
        (error "not actually a key-value line: ~S" line))))

(there's additional reader macrology in let-over-lambda to make that lambda form unnecessary, but we can ignore that for our purposes).

Now, if-match has a similar problem that defmacro/g! had: since the tree walker doesn't make a distinction between symbols present for evaluation and symbols for any other purpose, it is possible to confuse the walker. For example:

(if-match (scanner string)
    (if (> (length $1) 6)
        '|$1000000|
        'less-than-$1000000))

This form, if macroexpanded, will attempt to bind one million variables to matched groups; even if the compiler doesn't choke on that, evaluation will go wrong, as the matcher is unlikely to match one million groups (so the "not enough matches" error branch will be taken) - whereas of course the quoted one million dollar symbol is not intended for evaluation.

But the nesting problems are more obvious in this case than for defmacro/g!. Firstly, take the simple case:

(if-match (scanner string)
    (list $1
          (if-match (scanner2 string)
              $2
              nil))
    nil)

Here, the $2 is in the scope of the inner if-match, and so mustn't be included for the macroexpansion of the outer if-match. This case is handled in let-over-lambda's implementation by the prune-if-match-bodies-from-sub-lexical-scope: the consequent of the inner if-match is pruned from the dollar-symbol accumulator. However, there are several issues with this; the first is that the test is pruned:

(if-match (scanner string)
    (if-match (scanner2 $2)
        $1
        nil)
    nil)

In this example, the $2 is 'invisible' to the outer if-match, and so won't get a binding. That's straightforwardly fixable, along with the mishandling of when-let's syntax (the entire body of when-let should be pruned, not just the first form), and what I think is an error in the pruning of if-match (it should recurse on the cdddr, not the cddr; github issue).

Not fixable at all while still using naïve code-walking are two other problems, one of which is noted in the comment present in the let-over-lambda code: the pruner doesn't distinguish between if-match forms for evaluation and other conses whose car is if-match. Triggering this problem does involve some contortions - in order for it to matter, we need an if-match not for evaluation followed by a dollar symbol which is to be evaluated; but, for example:

(defmacro list$/q (&rest args)
  `(list ,@(mapcar (lambda (x) (if (dollar-symbol-p x) x `',x)) args)))
(if-match (scanner string)
    (list$/q foo if-match $2)
    nil)

Here, although the $2 is in a position for evaluation (after macroexpansion), it will have no binding because it will have been pruned when naïvely walking the outer if-match macro. The if-match symbol argument to `list$/q ends up quoted, and should not be treated as a macro call.

Also, the pruner function must have special knowledge not just about the semantics of if-match, but also of any macro which can expand to if-match - see the attempt to handle when-match in the pruner. If a user were to have the temerity to define case-match

(defmacro case-match (string &rest clauses)
  (if (null clauses)
      nil
      `(if-match (,(caar clauses) ,string)
           (progn ,@(cdar clauses))
           (case-match string ,@(cdr clauses)))))

any attempt to nest a case-match inside an outer if-match is liable to fail, as the pruner has no knowledge of how to handle the case-match form.

All of these problems are solvable by using a proper code-walker. The code-walker should collect up all dollar symbols to be evaluated in the consequent of an if-match form, so that bindings for them can be generated, except for those with already existing lexical bindings within the if-match (not those from outside, otherwise nesting won't work). For testing purposes, we'll also signal a diagnostic condition within the macroexpander to indicate which dollar symbols we've found.

(define-condition if-match/walked-diagnostic (condition)
  ((symbols :initarg :symbols :reader if-match-symbols)))
(defmacro if-match/walked ((test string) consequent &optional alternative)
  (let* (dollar-symbols)
    (flet ((dollar-walker (subform context env)
             (declare (ignore context))
             (typecase subform
               (symbol
                (when (and (dollar-symbol-p subform)
                           (not (sb-walker:var-lexical-p subform env)))
                  (pushnew subform dollar-symbols))
                subform)
               (t subform))))
      (handler-bind ((if-match/walked-diagnostic #'continue))
        (sb-walker:walk-form consequent nil #'dollar-walker))
      (let* ((dollar-symbols (sort dollar-symbols #'> :key #'dollar-symbol-p))
             (top (dollar-symbol-p (car dollar-symbols))))
        (with-simple-restart (continue "Ignore diagnostic condition")
          (signal 'if-match/walked-diagnostic :symbols dollar-symbols))
        (sb-int:with-unique-names (start end match-start match-end)
          (sb-int:once-only ((string string))
            `(multiple-value-bind (,start ,end ,match-start ,match-end)
                 (,test ,string)
               (declare (ignore ,end) (ignorable ,match-end))
               (if ,start
                   (if (< (length ,match-start) ,top)
                       (error "~S: too few matches: needed ~D, got ~D." 'if-match
                              ,top (length ,match-start))
                       (let ,(mapcar (lambda (s)
                                       (let ((i (1- (dollar-symbol-p s))))
                                         `(,s (subseq ,string (aref ,match-start ,i) (aref ,match-end ,i)))))
                                     (reverse dollar-symbols))
                         ,consequent))
                   ,alternative))))))))

(I'm using sb-int:once-only and sb-int:with-unique-names to avoid having to include their definitions in this post, which is getting a bit lengthy). Testing this looks like

(defmacro test-if-match (form expected-symbols)
  `(handler-case (macroexpand-1 ',form)
     (if-match/walked-diagnostic (c)
       (assert (equal (if-match-symbols c) ',expected-symbols)))
     (:no-error (&rest values) (declare (ignore values)) (error "no diagnostic"))))
(test-if-match (if-match/walked (test string) (list $1 $2) 'foo) ($2 $1))
(test-if-match (if-match/walked (test string) (if (> (length $1) 6) '$10 '$8) nil) ($1))
(test-if-match (if-match/walked (scanner string)
                   (list $1
                         (if-match/walked (scanner2 string)
                             $2
                             nil))
                   nil)
               ($1))
(test-if-match (if-match/walked (scanner string) (list$/q foo if-match/walked $3) nil) ($3))
(defmacro case-match/walked (string &rest clauses)
  (if (null clauses)
      nil
      `(if-match/walked (,(caar clauses) ,string)
           (progn ,@(cdar clauses))
           (case-match/walked string ,@(cdr clauses)))))
(test-if-match (if-match/walked (scanner string)
                   (case-match/walked $1
                     (foo $2)
                     (bar $3)))
               ($1))

To summarize: I've shown here how to make use of a full code-walker to make a couple of code transforming macros more robust. Full code-walkers can do more than just what I've shown here: the sb-walker:walk-form interface can also inhibit macroexpansion, transform function calls into calls to other functions, while respecting the semantics of the Lisp operators in the code that is being walked and allowing some introspection of the lexical environment. Here, we have called sb-walker:walk-form for side effects from the walker function we've provided; it is also possible to use its value (that's how sb-cltl2:macroexpand-all is implemented, for example). I hope that this can help users affected by the change in internal representation of backquote, as well as others who want to write advanced code-transforming macros. If the thought of using an SBCL-internal code-walker makes you a bit queasy (as well it might), you could instead start by looking at one or two other more explicitly-portable code-walkers out there, for example John Fremlin's macroexpand-dammit, the walker in Alex Plotnick's CLWEB literate programming system (github link), or the code walker in iterate.

08 Sep 2014 8:13pm GMT

07 Sep 2014

feedPlanet Lisp

Pascal Costanza: "Why I like Common Lisp"

In a recent email exchange discussion, Charlotte Herzeel gave a summary of Common Lisp that I believe is worth repeating publicly. With her permission, I repeat her statements here.

"An important reason why I like Common Lisp a lot is that the language has a layered design that supports incremental development. The language provides very high-level programming abstractions, such as object-oriented programming, dynamic multiple dispatch, garbage collection, a meta-object protocol, and so on. These abstractions are typically open implementations, built on top of more efficient low-level abstractions the user can also choose to access directly.

Common Lisp is typically implemented as a compiled language, compiling directly to machine code. The runtime components are sparse, the garbage collector being an important one. Common Lisp provides the means to steer the compiler and runtime components to do low-level optimizations. Examples of this include: type declarations to remove type-checking at runtime; inline declarations to avoid dispatch; dynamic extent declarations to perform stack allocation instead of heap allocation; disassembly of code snippets; tuning of the garbage collector to switch between collection strategies; and so on. Optimizations such as these are optional and localized. Hence it is very easy in Common Lisp to rapidly prototype and then incrementally optimize the code by identifying the hotspots through profiling. This way you can often be as efficient as with C code, without being forced to program in a low-level style from the start throughout your whole program.

Hence in contrast to C/C++, Common Lisp allows you to optimize code incrementally and locally for a particular snippet of code. In contrast to Java - or any other language with an implementation that performs optimization at runtime through tracing or JIT compiling or so - Common Lisp implementations employ in a sense a more classic compilation approach. In this sense, Common Lisp makes it easier to 'control' what you are measuring when profiling programs.

The Common Lisp Object System (CLOS) is a library in Common Lisp for object-oriented programming. Common Lisp is a multi-paradigm language, so it depends on your problem whether it is a good idea to use object-oriented programming or not. That said, CLOS is very different from mainstream object-oriented programming. It allows multiple inheritance, multiple dispatch, and is based on generic functions, i.e. classes define types, and methods are defined separately as part of generic functions. The CLOS implementation performs a lot of clever optimizations at runtime, for example for method lookup. What is of course special about CLOS, is that it has a meta-object protocol, which allows you to extend/modify CLOS in an organized way. For example, you have hooks into the method dispatch protocol, the slot (= field) access protocol, etc. If you want to know more about the CLOS implementation and the meta-object protocol, read 'The Art of the Meta-Object Protocol' by Kiczales, des Rivieres, Bobrow.

Common Lisp just has a lot of advanced language features that you just don't find in other languages.

From a practical point of view, I can recommend LispWorks as a Common Lisp implementation. LispWorks is very user-friendly because it comes with an integrated development environment. This means you get Smalltalk-like features such as code browsers and inspector tools. Another user-friendly implementation that is free is Clozure Common Lisp. The most widely used open-source implementation is SBCL, which is very stable and very efficient. There are lots of other Common Lisp implementations out there, but I recommend one of these three.

If you want to learn about Common Lisp, I can recommend "Ansi Common Lisp" by Graham. Maybe also interesting: 'Pascal Costanza's highly opinionated guide to Common Lisp' ;-). If you want a funny introduction to Common Lisp, check out the Lisperati. A good place to snoop for Common Lisp war stories is Planet Lisp. If you want to get an idea about libraries, see quicklisp."

07 Sep 2014 3:21pm GMT

Timofei Shatrov: My little transliterator can't be this CLOS

If you are reading this blog, you are probably able to read Latin script. It is pretty widespread in the world, and used by 70% of the world's population according to Wikipedia. Perhaps, like me, your native language uses a different script. There are many writing systems in the world, some are related, and some are wildly different from each other. Fortunately with the advent of the Internet and tools like Google Translate it is increasingly possible to read text not only in the language you don't understand, but even the languages where you don't even understand their writing system.

Well, Google is Google, but is it possible for a mere mortal to create something like that? Not to translate, but just to present some unknown writing system in your preferred alphabet (the process is called transliteration or transcription)? There's no reason why not.

In this post I'll talk about the process of romanization of Japanese language, which is transcription from Japanese to Latin script. For example "ありがとうございます" is romanized to "arigatō gozaimasu" under Hepburn romanization method (there are many of those).

First off, the basics of Japanese writing are as follows:

  1. There are several scripts used to write in Japanese language.
  2. Hiragana is a syllabary (a writing system where each character represents a syllable) that is used for words of Japanese origin.
  3. Katakana is another syllabary that is used for loan words. Every possible syllable in Japanese language has a hiragana and katakana form, which usually are completely different. Both scripts have about 50 characters in them.
  4. Chinese characters (kanji) are used for words of Japanese and Chinese origin. There are thousands of such characters. Furthermore, most of them could be read in several different ways, which makes transcribing them difficult. We're going to ignore those for now.


If we focus on romanization of hiragana and katakana (both systems are called kana for short) then the process seems pretty simple. It's just a matter of replacing each kana with the syllable it represents, written in roman letters. However there are some characters that do not represent a syllable, but rather modify a syllable before or after that character. This includes sokuon, which doubles the consonant of the next syllable and yoon characters, which are a small version of normal kana and are used to modify a vowel of a preceding syllable.

Ok, so the first thing we must do is to somehow bring order to this madness. Since there is hiragana and katakana version of each character, it doesn't make sense to work with the characters directly. Instead I'm going to replace each character with a keyword.

(defparameter *sokuon-characters* '(:sokuon "っッ"))

(defparameter *iteration-characters* '(:iter "ゝヽ" :iter-v "ゞヾ"))

(defparameter *modifier-characters* '(:+a "ぁァ" :+i "ぃィ" :+u "ぅゥ" :+e "ぇェ" :+o "ぉォ"
                                      :+ya "ゃャ" :+yu "ゅュ" :+yo "ょョ"
                                      :long-vowel "ー"))

(defparameter *kana-characters*
  '(:a "あア"     :i "いイ"     :u "うウ"     :e "えエ"     :o "おオ"
    :ka "かカ"    :ki "きキ"    :ku "くク"    :ke "けケ"    :ko "こコ"
    :sa "さサ"    :shi "しシ"   :su "すス"    :se "せセ"    :so "そソ"
    :ta "たタ"    :chi "ちチ"   :tsu "つツ"   :te "てテ"    :to "とト"
    :na "なナ"    :ni "にニ"    :nu "ぬヌ"    :ne "ねネ"    :no "のノ"
    :ha "は" :hha "ハ" :hi "ひヒ" :fu "ふフ"  :he "へヘ"    :ho "ほホ"
    :ma "まマ"    :mi "みミ"    :mu "むム"    :me "めメ"    :mo "もモ"
    :ya "やヤ"                  :yu "ゆユ"                 :yo "よヨ"
    :ra "らラ"    :ri "りリ"    :ru "るル"    :re "れレ"    :ro "ろロ"
    :wa "わワ"    :wi "ゐヰ"                 :we "ゑヱ"    :wo "を" :wwo "ヲ"
    :n "んン"

    :ga "がガ"    :gi "ぎギ"    :gu "ぐグ"    :ge "げゲ"    :go "ごゴ"
    :za "ざザ"    :ji "じジ"    :zu "ずズ"    :ze "ぜゼ"    :zo "ぞゾ"
    :da "だダ"    :dji "ぢヂ"   :dzu "づヅ"   :de "でデ"    :do "どド"
    :ba "ばバ"    :bi "びビ"    :bu "ぶブ"    :be "べベ"    :bo "ぼボ"
    :pa "ぱパ"    :pi "ぴピ"    :pu "ぷプ"    :pe "ぺペ"    :po "ぽポ"
    ))

(defparameter *all-characters* (append *sokuon-characters*
                                       *iteration-characters*
                                       *modifier-characters*
                                       *kana-characters*))

(defparameter *char-class-hash*
  (let ((hash (make-hash-table)))
    (loop for (class chars) on *all-characters* by #'cddr
         do (loop for char across chars
               do (setf (gethash char hash) class)))
    hash))

(defun get-character-classes (word)
(map 'list (lambda (char) (gethash char *char-class-hash* char)) word))

This creates a hash table that maps every kana to a keyword that describes it and we can now trivially convert a word into a list of "character classes" (or the characters themselves for non-kana characters). Then we need to transform this list into a kind of AST where modifier characters have the role of functions.

(defun process-modifiers (cc-list)
  (loop with result
       for (cc . rest) on cc-list
       if (eql cc :sokuon)
         do (push (cons cc (process-modifiers rest)) result) (loop-finish)
       else if (member cc *modifier-characters*)
         do (push (list cc (pop result)) result)
       else do (push cc result)
       finally (return (nreverse result))))

This is your basic push/nreverse idiom with some extra recursiveness added. Sokuon is applied to everything to the right of it, because I wanted it to have lower precedence, i.e. (:sokuon :ka :+yu) is parsed as (:sokuon (:+yu :ka)) instead of the other way around. Now we can write the outline of our algorithm:

(defun romanize-core (method cc-tree)
  (with-output-to-string (out)
    (dolist (item cc-tree)
      (cond ((null item)) 
            ((characterp item) (princ item out))
            ((atom item) (princ (r-base method item) out))
            ((listp item) (princ (r-apply (car item) method (cdr item)) out))))))

The functions r-base and r-apply are generic functions that will depend on the method of romanization. Another generic function will be r-simplify that will "pretty up" the result. It is easy to write some reasonable fallback methods for them:

(defgeneric r-base (method item)
  (:documentation "Process atomic char class")
  (:method (method item)
    (string-downcase item)))

(defgeneric r-apply (modifier method cc-tree)
  (:documentation "Apply modifier to something")
  (:method ((modifier (eql :sokuon)) method cc-tree)
    (let ((inner (romanize-core method cc-tree)))
      (if (zerop (length inner)) inner
          (format nil "~a~a" (char inner 0) inner))))
  (:method ((modifier (eql :long-vowel)) method cc-tree)
    (romanize-core method cc-tree))
  (:method ((modifier symbol) method cc-tree)
    (format nil "~a~a" (romanize-core method cc-tree) (string-downcase modifier))))
    
(defgeneric r-simplify (method str)
  (:documentation "Simplify the result of transliteration")
  (:method (method str) str))

Of course relying on symbol names isn't flexible at all. It's better to have a mapping from each keyword to a string that represents it. This is where we have to resort to classes to store this mapping in a slot.

(defclass generic-romanization ()
  ((kana-table :reader kana-table
               :initform (make-hash-table))))

(defmethod r-base ((method generic-romanization) item)
  (or (gethash item (kana-table method)) (call-next-method)))

(defmethod r-apply ((modifier symbol) (method generic-romanization) cc-tree)
  (let ((yoon (gethash modifier (kana-table method))))
    (if yoon
        (let ((inner (romanize-core method cc-tree)))
          (format nil "~a~a" (subseq inner 0 (max 0 (1- (length inner)))) yoon))
        (call-next-method))))

(defmacro hash-from-list (var list)
(alexandria:with-gensyms (hash key val)
`(defparameter ,var
(let ((,hash (make-hash-table)))
(loop for (,key ,val) on ,list
do (setf (gethash ,key ,hash) ,val))
,hash))))

(hash-from-list *hepburn-kana-table*
'(:a "a" :i "i" :u "u" :e "e" :o "o"
:ka "ka" :ki "ki" :ku "ku" :ke "ke" :ko "ko"
:sa "sa" :shi "shi" :su "su" :se "se" :so "so"
:ta "ta" :chi "chi" :tsu "tsu" :te "te" :to "to"
:na "na" :ni "ni" :nu "nu" :ne "ne" :no "no"
:ha "ha" :hha "ha" :hi "hi" :fu "fu" :he "he" :ho "ho"
:ma "ma" :mi "mi" :mu "mu" :me "me" :mo "mo"
:ya "ya" :yu "yu" :yo "yo"
:ra "ra" :ri "ri" :ru "ru" :re "re" :ro "ro"
:wa "wa" :wi "wi" :we "we" :wo "wo" :wwo "wo"
:n "n"
:ga "ga" :gi "gi" :gu "gu" :ge "ge" :go "go"
:za "za" :ji "ji" :zu "zu" :ze "ze" :zo "zo"
:da "da" :dji "ji" :dzu "zu" :de "de" :do "do"
:ba "ba" :bi "bi" :bu "bu" :be "be" :bo "bo"
:pa "pa" :pi "pi" :pu "pu" :pe "pe" :po "po"
:+a "a" :+i "i" :+u "u" :+e "e" :+o "o"
:+ya "ya" :+yu "yu" :+yo "yo"
))
(defclass generic-hepburn (generic-romanization) ((kana-table :initform (alexandria:copy-hash-table *hepburn-kana-table*))))

I'm going for a rather versatile class hierarchy here, starting with a completely empty kana-table for generic-romanization method, but defining the methods on it that will work for any table. Then I define a class generic-hepburn that will be the basis for different hepburn variations. The table is taken from Wikipedia article on Hepburn romanization, which is pretty detailed. By carefully reading it, we can identify the exceptions that the above functions can't handle. For example a :sokuon before :chi is romanized as "tchi" and not as "cchi" as it would by the simple consonant-doubling method. Another exception is that, for example, :chi followed by :+ya is romanized as "cha", not "chya". CLOS makes it easy to handle these irregularities before passing the torch to a less specific method.

(defmethod r-apply ((modifier (eql :sokuon)) (method generic-hepburn) cc-tree)
  (if (eql (car cc-tree) :chi)
      (concatenate 'string "t" (romanize-core method cc-tree))
      (call-next-method)))

(defmethod r-apply ((modifier (eql :+ya)) (method generic-hepburn) cc-tree)
  (case (car cc-tree)
    (:shi "sha")
    (:chi "cha")
    ((:ji :dji) "ja")
    (t (call-next-method))))
    
... and the same for :+yu and :+yo

Another thing Hepburn romanizations do is simplifying double vowels like "oo", "ou" and "uu". For example, our generic-hepburn will romanize "とうきょう" as "toukyou", while most people are more familiar with the spelling "Tokyo" or "Tōkyō".

(defun simplify-ngrams (str map)
  (let* ((alist (loop for (from to) on map by #'cddr collect (cons from to)))
         (scanner (ppcre:create-scanner (cons :alternation (mapcar #'car alist)))))
    (ppcre:regex-replace-all scanner str 
                             (lambda (match &rest rest)
                               (declare (ignore rest))
                               (cdr (assoc match alist :test #'equal)))
                             :simple-calls t)))

(defclass simplified-hepburn (generic-hepburn)
  ((simplifications :initform nil :initarg :simplifications :reader simplifications
                    :documentation "List of simplifications e.g. (\"ou\" \"o\" \"uu\" \"u\")"
                    )))

(defmethod r-simplify ((method simplified-hepburn) str)
  (simplify-ngrams (call-next-method) (simplifications method)))

(defclass traditional-hepburn (simplified-hepburn)
((simplifications :initform '("oo" "ō" "ou" "ō" "uu" "ū"))))

I'm using the "parse tree" feature of CL-PPCRE here to create a complex :alternation regex on the fly and then use regex-replace-all with a custom replacing function. It's probably not the most efficient method, but sometimes outsourcing string manipulations to a well-tested regex engine is the least painful solution. Anyway, we're really close now, and all that's left is to chain up our functions for a useful API.

(defparameter *hepburn-traditional* (make-instance 'traditional-hepburn))

(defvar *default-romanization-method* *hepburn-traditional*)

(defun romanize-list (cc-list &key (method *default-romanization-method*))
  "Romanize a character class list according to method"
  (let ((cc-tree (process-modifiers cc-list)))
    (values (r-simplify method (romanize-core method cc-tree)))))

(defun romanize-word (word &key (method *default-romanization-method*))
  "Romanize a word according to method"
  (romanize-list (get-character-classes word) :method method))

>>> (romanize-word "ありがとうございます")
"arigatōgozaimasu"

At my Github you can find an unabridged version of the above code. However there are still some difficult problems with romanization of Japanese that can't be solved as easily. Even leaving kanji aside, the hiragana character は is pronounced either as "ha" or "wa" depending on whether it is used as a particle. For example a common greeting "こんにちは" is romanized as "konnichiwa" and not "konnichiha" because は plays the role of a particle. Which brings us to another problem: there are no spaces between the words, so it's not possible to determine whether は is a part of a word or a standalone particle without a dictionary, and even then it can be ambiguous! I'm ending the post on this note, since I'm still not sure how to solve this. さようなら!

07 Sep 2014 10:00am GMT

05 Sep 2014

feedPlanet Lisp

Joe Marshall: Another stupid homographic function trick

In my last post I showed that if you take a homographic function and apply it to a fraction, you can partly apply the function to the integer part of the fraction and get a new homographic function. The new function can be applied to the non-integer part of the fraction to generate an answer equivalent to the original function applied to the original fraction.

It turns out that you can go in the other direction as well. You can partly evaluate a homographic function. For example, consider this homographic function:

((lambda (t)
   (/ (+ (* 70 t) 29)
      (+ (* 12 t)  5))) n)

Which we intend to apply to some positive number n. Even if all we know is that n is positive, we can deduce that the value of the homographic function is between 29/5 (when t is 0) and 70/12 (as t goes to infinity). The integer part of those values are the same, so we can factor that out:

(+ 5 (/ 1 ((lambda (t)
               (/ (+ (* 12 t) 5)
                  (+ (* 10 t) 4))) n)))

The partial answer has an integer value of 5 and a fractional part that contains a new homographic function applied to our original n. We can do it again:

(+ 5 (/ 1
        (+ 1 (/ 1
                ((lambda (t)
                   (/ (+ (* 10 t) 4)
                      (+ (* 2 t) 1))) n)))))

The fractional part of the answer can itself be factored into another integer and a new homographic function applied to our original n.

A generalized continued fraction is a number of the form:

If all the bi are 1, then it is a simple continued fraction. You can turn generalized continued fractions into a simple continued fraction by doing the algebra.

What happens if you partly apply a homographic function to a continued fraction? The algebra is tedious, but here's what happens:

((lambda (t)
   (/ (+ (* 2 t) 1)
      (+ (* 1 t) 3))) (+ 3 (/ 1 (+ 7 (/ 1 16)))))

;; after one step
((lambda (t)
   (/ (+ (* 7 t) 2)
      (+ (* 6 t) 1))) (+ 7 (/ 1 16)))

;; after two steps
((lambda (t)
   (/ (+ (* 51 t) 7)
      (+ (* 43 t) 6))) 16)

By partly apply a homographic function to a continued fraction, we can process the integer part separately and before the fractional part. By partly evaluating the application of a homographic function, we can often determine the integer part without fully evaluating the argument to the function. For example, after step one above, we could instead partially evaluate the application:

;; after one step
((lambda (t)
   (/ (+ (* 7 t) 2)
      (+ (* 6 t) 1))) (+ 7 (/ 1 16)))

;; Alternatively, partially evaluate first term
(+ 1 (/ 1
       ((lambda (t)
           (/ (+ (* 6 t) 1)
              (+ (* 1 t) 1))) (+ 7 (/ 1 16)))))

05 Sep 2014 3:53pm GMT