22 Nov 2019

feedPlanet Lisp

ABCL Dev: Unleashing the Bear past Java 11

Against the falling canvas of 2019, the Bear has freely thrust ABCL
1.6.0 onto the world stage.

This release fully supports building and running on Java 6, 7, 8, and
11 via the openjdk runtimes. By building from our
source, it is rumored that the User may utilize additional
hosting platforms such as Java 13.

Support for Microsoft Windows platforms has not been thoroughly tested
at this point but we intend to add it our automated cloud based build
in the process of releasing a stabilizing ABCL 1.6.1 build due by the
end of this year.

Thanks to all past, present, and future contributors to the Bear;
here's to hackin' our way out of the prison of the Java
language for over a decade.


22 Nov 2019 2:31pm GMT

feedPlanet Debian

Sylvain Beucler: Android Rebuilds updates

What is it already?

Android Rebuilds provides freely-licensed builds of Android development tools written by somebody else.

New builds

SDK 10 (API 29) and NDK 20 rebuilds are now available, as unattended build scripts as well as binaries you shan't trust.

sdkmanager integration will be complete when we figure out how to give our repo precedence over somebody else's.

Evolution of the situation

SDK build remains monolithic and growing (40GB .git, 7h multi-core build, 200GB build space).

But there are fewer build issues, thanks to newer "prebuilts" deps straight in Git, now including OpenJDK.
I expect we'll soon chroot in Git before build.

Also for the first time ever I could complete a NDK windows build.

Licensing

Official binaries are still click-wrapped with a proprietary license.

It was discovered that such a license is also covering past versions of android.jar & al. hidden in a prebuilts directory and somehow necessary to the builds.
Archeological work already successfully started to rebuild SDKs from the start of the decade.

Fanbase

Android Rebuilds is showcased in ungoogled-chromium-android, a lightweight approach to removing Google web service dependency.

F-Droid mirror

After some back and forth, the F-Droid mirror is stable and limited to the experimental sdkmanager repository.
F-Droid showed high dedication to implementing upload restrictions and establishing procedures.
I have great hope that they will soon show the same level of dedication dropping non-free licenses and freeing their build server.

22 Nov 2019 1:35pm GMT

feedPlanet Python

Jaime Buelta: “Hands-On Docker for Microservices with Python” is now available!

Last year I published a book, and I liked the experience, so I wrote another! The book is called Hands-On Docker for Microservices with Python, and it goes through the different steps to move from a Monolith Architecture towards a Microservices one. It is written from a very practical stand point, and aims to cover … Continue reading "Hands-On Docker for Microservices with Python" is now available!

22 Nov 2019 1:33pm GMT

Ian Ozsvald: Higher Performance Python (ODSC 2019)

Building on PyDataCambridge last week I had the additional pleasure of talking on Higher Performance Python at ODSC 2019 yesterday. I had a brilliant room of 300 Pythonic data scientists at all levels who asked an interesting array of questions:

Happy smiling audience

This talk expanded on last week's version at PyDataCambridge as I had some more time. The problem-intro was a little longer (and this helped set the scene as I had more first-timers in the room), then I dug a little further into Pandas and added extra advice at the end. Overall I covered:

"It was a fantastic talk." - Stewart

My publisher O'Reilly were also kind enough to send over a box of the 1st edition High Performance Python books for signing, just as I did in Cambridge last week. As usual I didn't have enough free books for those hoping for a copy - sorry if you missed out (I only get given a limited set to give away). The new content for the 2nd edition is available online in O'Reilly's Safari Early Access Programme.

Book signing

The talk ends with my customary note requesting a postcard if you learned something useful - feel free to send me an email asking for my address, I love to receive postcards 🙂 I have an email announce list for my upcoming training in January with a plan to introduce a High Performance Python training day, so join that list if you'd like low-volume announcements. I have a twice-a-month email list for "Ian's Thoughts & Jobs Listing" which includes jobs I know about in our UK community and my recommendations and notes. Join this if you'd like an idea of what's happening in the UK Pythonic Data Science scene.

The 2nd edition of High Performance Python should be out for next April, preview it in the Early Access Programme here.


Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.

The post Higher Performance Python (ODSC 2019) appeared first on Entrepreneurial Geekiness.

22 Nov 2019 1:21pm GMT

Codementor: Python, Javascript, and Web automation

Comparing Python and Javascript in the context of web automation.

22 Nov 2019 12:33pm GMT

feedPlanet Grep

Xavier Mertens: [SANS ISC] Abusing Web Filters Misconfiguration for Reconnaissance

I published the following diary on isc.sans.edu: "Abusing Web Filters Misconfiguration for Reconnaissance":

Yesterday, an interesting incident was detected while working at a customer SOC. They use a "next-generation" firewall that implements a web filter based on categories. This is common in many organizations today: Users' web traffic is allowed/denied based on an URL categorization database (like "adult content", "hacking", "gambling", …). How was it detected? [Read more]

[The post [SANS ISC] Abusing Web Filters Misconfiguration for Reconnaissance has been first published on /dev/random]

22 Nov 2019 11:17am GMT

feedPlanet Debian

Joachim Breitner: Faster Winter 5: Eta-Expanding ReaderT

(This is the fifth optimization presented in the "faster winter" series, please see that post for background information.)

Another good approach to performance turning is look at the code after GHC optimized it. So I planted a

{-# OPTIONS_GHC -ddump-simpl -dsuppress-coercions -dsuppress-unfoldings -dsuppress-module-prefixes #-}

at the top of Eval.hs, and looked through the code. This is not Haskell, but rather "GHC Core", a simpler functional programming language that GHC uses internally. It is more verbose and less pretty, so it takes a bit of getting used to, but it's nothing to be afraid of when you are a somehat experienced Haskell developer.

There I found (much simplified) the following pattern:

step :: Instr -> Config -> IO Result
step e = case e of
  Add -> \c ->do stuff …
  Del -> \c ->do stuff …
  …

That's bad! Yes, Haskell is a functional language, and passing around anonymous functions is very nice to write expressive code, and for most purposes it is not too slow … but in an inner loop, you really don't want any such closures. So where did this come from? And as expected, the Haskell source did not have those inner lambdas. Instead, it was using a very innocent looking monad transformer:

step :: Instr -> ReaderT Config IO Result
step e = case e of
  Add -> do stuff …
  Del -> do stuff …
  …

A ReaderT r m a is just a different way of writing r -> m a that allows us to use do-notation or the monad combinators without having to pass the r around explicity, and as such it is indeed very convenient. But not as efficient as if we had written

step :: Instr -> Config -> IO Result
step e c = case e of
  Add ->do stuff …
  Del ->do stuff …
  …

where the step function takes two arguments right away, and no anonymous functions are created.

Why doesn't our amazing Haskell compiler figure out that this would be better? Because it is not better in all situations: If store step e :: ReaderT Config IO Result somewhere and and use it many times, with the same e but passing many different c :: Config, then we have to do the case e analysis only once. This can sometimes be better, so the compiler has to leave it in that form, in case we did it intentionally.

(Incidentially, the question of how to allow the compiler to eta-expand more function seems to eternally haunt me, and its pursuit even led to a PhD thesis.

So how can we fix it? One relatively crude way is to shove it into the compiler face that we really want step to be a function with two parameters by wrapping the whole body in, well, a lambda.. But we still want to use the Reader monad in the body of step

So I came up with this:

step :: Instr -> ReaderT Config IO Result
step e = ReaderT $ \c -> ($ c) $ runReaderT $ case e of
  Add ->do stuff …
  Del ->do stuff …
  …

Now the \c -> is outside the case, the compiler adds it to the arguments of step and we get the code that we want (confirmed by a quick peek at the Core).

Improvement: Allocations: -23.20% Time: -23.00% (Commit f5a0dd2...894070f)

I used this pattern in more than once place, so I wanted to abstract it into a little helper definition. But that's not so easy: If I just write

etaReaderT :: ReaderT r m a -> ReaderT r m a
etaReaderT m = ReaderT $ \c -> ($ c) $ runReaderT m

step :: Instr -> ReaderT Config IO Result
step e = etaReaderT $ case e of
  Add ->do stuff …
  Del ->do stuff …
  …

then the whole thing doesn't work any more! Because now, the case e is again "outside" the \c ->.

I whined on twitter about this, and Sebastian Graf reminded me helpfully of GHC.Exts.oneShot, a little magic function that was added to GHC 5 years ago … by some forgetful person: me.

If we use this in the right place inside etaReaderT it tells GHC in a soothing voice "hey! it's ok! you can move this lambda out of cases. believe me. it's gonna be ok". And with this, it works:

etaReaderT :: ReaderT r m a -> ReaderT r m a
etaReaderT = ReaderT . oneShot . runReaderT

I wonder if this function would make a good addition to Control.Monad.Trans.Reader.

Incidentally, if you look at the code at the end of all my optimizations, there is no mention of etaReaderT any more: Subsequent optimizations simplified the code so much that eventually GHC was able to be able to do this transformation without my help.

22 Nov 2019 9:59am GMT

21 Nov 2019

feedPlanet Debian

Jonathan Dowland: 8-bit

On Yesterday's Mary Anne Hobbs radio show she debuted a new track by Squarepusher, "Vortrack [Fracture Remix]", which you can now watch/listen to on YouTube:

This track really grabbed my attention. Later that day you you could buy it in a variety of formats and quality levels on Squarepusher's website.

One of the format/quality options was "8-bit Lossless WAV", which I thought was a joke, a poke in the eye of audiophiles. I was aware that he likely used some 8-bit instruments/equipment to write the tracks, but surely it was mixed in a more modern environment, and resampling down to 8-bit would result in something that sounded like mush.

But it seems the jokes on me; I bought the track and it's seemingly indistinguishable to the audio track on that YouTube video. And it really is 8-bit:

Input #0, wav, from 'Vortrack-001-Squarepusher-Vortrack (Fracture Remix).wav':
  Duration: 00:08:02.99, bitrate: 705 kb/s
    Stream #0:0: Audio: pcm_u8 ([1][0][0][0] / 0x0001),
    44100 Hz, 2 channels, u8, 705 kb/s

It even - losslessly - compressed down to a bitrate lower than a typical MP3:

Input #0, flac, from 'Vortrack-001-Squarepusher-Vortrack (Fracture Remix).flac':
  Duration: 00:08:02.99, start: 0.000000, bitrate: 313 kb/s
    Stream #0:0: Audio: flac, 44100 Hz, stereo, s16 (8 bit)

Sorry, comments on my site seem to be broken at the moment. I'm hoping to fix them soon.

21 Nov 2019 11:55am GMT

feedPlanet Grep

Mattias Geniar: Remove index.php from the URL in Laravel

If you have a Laravel project, you might be surprised to find your routes are probably available on a number of different URLs.

21 Nov 2019 12:00am GMT

Mattias Geniar: Mac OS: xcrun: error: invalid active developer path, missing xcrun

I ran into this error when I tried any git operation (like git clone and git pull) on a Mac.

21 Nov 2019 12:00am GMT

feedPlanet Lisp

Didier Verna: Quickref 3.0 "The Alchemist" is released

Following up with the latest release of Declt, I'm also happy to announce the release of Quickref 3.0 "The Alchemist", our reference manuals aggregator for Quicklisp libraries. The official website has been updated yesterday with both the latest Quicklisp version, and the new Declt, resulting in the documentation of 1792 Common Lisp libraries.

21 Nov 2019 12:00am GMT

20 Nov 2019

feedPlanet Lisp

Vsevolod Dyomkin: Programming Algorithms: Strings

It may not be immediately obvious why the whole chapter is dedicated to strings. Aren't they just glorified arrays? There are several answers to these challenges:

In the base case, a string is, indeed, an array. As we already know, this array may either store its length or be a 0-terminated security catastrophe, like in C (see buffer overflow). So, to iterate, strings should store their length. Netstrings are a notable take on the idea of the length-aware strings: it's a simple external format that serializes a string as a tuple of length and contents, separated by a colon and ending with a comma: 3:foo, is the netsrting for the string foo.

More generally, a string is a sequence of characters. The characters themselves may be single bytes as well as fixed or variable-length byte sequences. The latter character encoding poses raises a challenging question of what to prefer, correctness or speed? With variable-length Unicode code points, the simplest and fastest string variant, a byte array, breaks, for it will incorrectly report its length (in bytes, not in characters) and fail to retrieve the character by index. Different language ecosystems address this issue differently, and the majority is, unfortunately, broken in one aspect or another. Overall, there may be two possible solution paths. The first one is to use a fixed-length representation and pad shorter characters to full length. Generally, such representation will be 32-bit UTF-32 resulting in up to 75% storage space waste for the most common 1-byte ASCII characters. The alternative approach will be to utilize a more advanced data-structure. The naive variant is a list, which implies an unacceptable slowdown of character access operation to O(n). Yet, a balanced approach may combine minimal additional space requirements with acceptable speed. One of the solutions may be to utilize the classic bitmap trick: use a bit array indicating, for each byte, whether it's the start of a character (only a 12% overhead). Calculating the character position may be performed in a small number of steps with the help of an infamous, in close circles, operation - Population count aka Hamming weight. This hardware instruction calculates the number of 1-bits in an integer and is accessible via logcount Lisp standard library routine. Behind the scenes, it is also called for bit arrays if you invoke count 1 on them. At least this is the case for SBCL:

CL-USER> (disassemble (lambda (x) 
(declare (type (simple-array bit) x))
(count 1 x)))

; disassembly for (LAMBDA (X))
; Size: 267 bytes. Origin: #x100FC9FD1A
...
; DA2: F3480FB8FA POPCNT RDI, RDX

The indexing function implementation may be quite tricky, but the general idea is to try to jump ahead n characters and calculate the popcount of the substring from the previous position to the current that will tell us the number of characters we have skipped. For the base case of a 1-byte string, we will get exactly where we wanted in just 1 jump and 1 popcount. However, if there were multibyte characters in the string, the first jump would have skipped less than n characters. If the difference is sufficiently small (say, below 10) we can just perform a quick linear scan of the remainder and find the position of the desired character. If it's larger than n/2 we can jump ahead n characters again (this will repeat at most 3 times as the maximum byte-length of a character is 4), and if it's below n/2 we can jump n/2 characters. And if we overshoot we can reverse the direction of the next jump or search. You can see where it's heading: if at each step (or, at least, at each 4th step) we are constantly half dividing our numbers this means O(log n) complexity. That's the worst performance for this function we can get, and it will very efficiently handle the cases when the character length doesn't vary: be it 1 byte - just 2 operations, or 4 bytes - 8 ops.

Here is the prototype of the char-index operation implemented according to the described algorithm (without the implementation of the mb-linear-char-index that performs the final linear scan):

(defstruct (mb-string (:conc-name mbs-))
bytes
bitmap)

(defparameter *mb-threshold* 10)

(defun mb-char-index (string i)
(let ((off 0))
(loop
(with ((cnt (count 1 (mbs-bitmap string) :start off :end (+ off i))))
(diff (- i cnt)))
(cond
((= cnt i) (return (+ off i)))
((< diff *mb-threshold*) (return (mb-linear-char-index
string diff off)))
((< cnt (floor i 2)) (:+ off i)
(:- i cnt))
(t (:+ off (floor i 2))
(:- i cnt)))))))

The length of such a string may be calculated by perfoming the popcount on the whole bitmap:

(defun mb-length (string)
(count 1 (mbs-bitmap string)))

It's also worth taking into account that there exists a set of rules assembled under the umbrella of the Unicode collation algorithm that specifies how to order strings containing Unicode code-points.

Basic String-Related Optimizations

Strings are often subject to subsequencing, so an efficient implementation may use structure sharing. As we remember, in Lisp, this is accessible via the displaced arrays mechanism (and a convenience RUTILS function slice that we have already used in the code above). Yet, structure sharing should be utilized with care as it opens a possibility for action-at-a-distance bugs if the derived string is modified, which results in parallel modification of the original. Though, strings are rarely modified in-place so, even in its basic form (without mandatory immutability), the approach works well. Moreover, some programming language environments make strings immutable by default. In such cases, to perform on-the-fly string modification (or rather, creation) such patterns as the Java StringBuilder are used, which creates the string from parts by first accumulating them in a list and then, when necessary, concatenating the list's contents into a single final string. An alternative approach is string formatting (the format function in Lisp) that is a higher-level interface, which still needs to utilize some underlying mutation/combination mechanism.

Another important string-related technology is interning. It is a space-saving measure to avoid duplicating the same strings over and over again, which operates by putting a string in a table and using its index afterwards. This approach also enables efficient equality comparison. Interning is performed by the compiler implicitly for all constant strings (in the special segment of the program's memory called "string table"/sstab), and also may be used explicitly. In Lisp, there's a standard function intern, for this. Lisp symbols used interned strings as their names. Another variant of interning is string pooling. The difference is that interning uses a global string table while the pools may be local.

Strings in the Editor

Now, let's consider situations, in which representing strings as arrays doesn't work. The primary one is in the editor. I.e. when constant random modification is the norm. There's another not so obvious requirement related to editing: handle potentially arbitrary long strings that still need to be dynamically modified. Have you tried opening a hundred-megabyte text document in your favorite editor? You'd better don't unless you're a Vim user :) Finally, an additional limitation of handling the strings in the editor is posed when we allow concurrent modification. This we'll discuss in the chapter on concurrent algorithms.

So, why array as a string backend doesn't work well in the editor? Because of content relocation required by all edit operations. O(n) editing is, obviously, not acceptable. What to do? There are several more advanced approaches:

  1. The simplest change will be, once again, to use an array of arrays. For example, for each line. This will not change the general complexity of O(n) but, at least, will reduce n significantly. The issue is that, still, it will depend on the length of the line so, for not so rare degraded case when there are few or no linebreaks, the performance will seriously deteriorate. And, moreover, having observable performance differences between editing different paragraphs of the text is not user-friendly at all.
  2. A more advanced approach would be to use trees, reducing access time to O(log n). There are many different kinds of trees and, in fact, only a few may work as efficient string representations. Among them a popular data structure, for representing strings, is a Rope. It's a binary tree where each leaf holds a substring and its length, and each intermediate node further holds the sum of the lengths of all the leaves in its left subtree. It's a more-or-less classic application of binary trees to a storage problem so we won't spend more time on it here. Suffice to say that it has the expected binary-tree performance of O(log n) for all operations, provided that we keep it balanced. It's an ok alternative to a simple array, but, for such a specialized problem, we can do better with a custom solution.
  3. And the custom solution is to return to arrays. There's one clever way to use them that works very well for dynamic strings. It is called a Gap buffer. This structure is an array (buffer) with a gap in the middle. I.e., let's imagine that we have a text of n characters. The Gap buffer will have a length of n + k where k is the gap size - some value, derived from practice, that may fluctuate in the process of string modification. You can recognize this gap as the position of the cursor in the text. Insertion operation in the editor is performed exactly at this place, so it's O(1). Just, afterwards, the gap will shrink by 1 character, so we'll have to resize the array, at some point, if there are too many insertions and the gap shrinks below some minimum size (maybe, below 1). The deletion operation will act exactly the opposite by growing the gap at one of the sides. The Gap buffer is an approach that is especially suited for normal editing - a process that has its own pace. It also allows the system to represent multiple cursors by maintaining several gaps. Also, it may be a good idea to represent each paragraph as a gap buffer and use an array of them for the whole text. The gap buffer is a special case of the Zipper pattern that we'll discuss in the chapter on functional data structures.

Substring Search

One of the most common string operations is substring search. For ordinary sequences we, usually, search for a single element, but strings, on the contrary, more often need subsequence search, which is more complex. A naive approach will start by looking for the first character, then trying to match the next character and the next, until either something ends or there's a mismatch. Unlike with hash-tables, Lisp standard library has good support for string processing, including such operations as search (which, actually, operates on any sequence type) and mismatch that compares two strings from a chosen side and returns the position at which they start to diverge.

If we were to implement our own string-specific search, the most basic version would, probably, look like this:

(defun naive-match (pat str)
(dotimes (i (- (1+ (length str)) (length pat)))
(when (= (mismatch pat (slice str i))
(length pat))
(return-from naive-match i))))

If the strings had been random, the probability that we are correctly matching each subsequent character would have dropped to 0 very fast. Even if we consider just the English alphabet, the probability of the first character being the same in 2 random strings is 1/26, the first and second - 1/676, and so on. And if we assume that the whole charset may be used, we'll have to substitute 26 with 256 or a greater value. So, in theory, such naive approach has almost O(n) complexity, where n is the length of the string. Yet, the worst case has O(n * m), where m is the length of the pattern. Why? If we try to match a pattern a..ab against a string aa.....ab, at each position, we'll have to check the whole pattern until the last character mismatches. This may seem like an artificial example and, indeed, it rarely occurs. But, still, real-world strings are not so random and are much closer to the uniform corner case than to the random one. So, researchers have come up with a number of ways to improve subsequence matching performance. Those include the four well-known inventor-glorifying substring search algorithms: Knuth-Morris-Pratt, Boyer-Moore, Rabin-Karp, and Aho-Corasick. Let's discuss each one of them and try to determine their interesting properties.

KMP

Knuth-Morris-Pratt is the most basic of these algorithms. Prior to performing the search, it examines the pattern to find repeated subsequences in it and creates a table containing, for each character of the pattern, the length of the prefix of the pattern that can be skipped if we have reached this character and failed the search at it. This table is also called the "failure function". The number in the table is calculated as the length of the proper suffix[1] of the pattern substring ending before the current character that matches the start of the pattern.

I'll repeat here the example provided in Wikipedia that explains the details of the table-building algorithm, as it's somewhat tricky.

Let's build the table for the pattern abdcabd. We set the table entry for the first char a to -1. To find the entry for b, we must discover a proper suffix of a which is also a prefix of the pattern. But there are no proper suffixes of a, so we set this entry to 0. To find the entry with index 2, we see that the substring ab has a proper suffix b. However b is not a prefix of the pattern. Therefore, we also set this entry to 0.

For the next entry, we first check the proper suffix of length 1, and it fails like in the previous case. Should we also check longer suffixes? No. We can formulate a shortcut rule: at each stage, we need to consider checking suffixes of a given size (1+ n) only if a valid suffix of size n was found at the previous stage and should not bother to check longer lengths. So we set the table entry for c to 0 also.

We pass to the subsequent character a. The same logic shows that the longest substring we need to consider has length 1, and as in the previous case it fails since d is not a prefix. But instead of setting the table entry to 0, we can do better by noting that a is also the first character of the pattern, and also that the corresponding character of the string can't be a (as we're calculating for the mismatch case). Thus there is no point in trying to match the pattern for this character again - we should begin 1 character ahead. This means that we may shift the pattern by match length plus one character, so we set the table entry to -1.

Considering now the next character b: though by inspection the longest substring would appear to be a, we still set the table entry to 0. The reasoning is similar to the previous case. b itself extends the prefix match begun with a, and we can assume that the corresponding character in the string is not b. So backtracking before it is pointless, but that character may still be a, hence we set the entry not to -1, but to 0, which means shifting the pattern by 1 character to the left and trying to match again.

Finally, for the last character d, the rule of the proper suffix matching the prefix applies, so we set the table entry to 2.

The resulting table is:

    a    b   c   d   a   b   d   
-1 0 0 0 -1 0 2

Here's the implementation of the table-building routine:

(defun kmp-table (pat)
(let ((rez (make-array (length pat)))
(i 0)) ; prefix length
(:= (? rez 0) -1)
(loop :for j :from 1 :below (length pat) :do
(if (char= (char pat i) (char pat j))
(:= (? rez j) (? rez i))
(progn (:= (? rez j) i
i (? rez i))
(loop :while (and (>= i 0)
(not (char= (char pat i) (char pat j))))
:do (:= i (? rez i)))))
(:+ i))
rez))

It can be proven that it runs in O(m). We won't show it here, so coming up with proper calculations is left as an exercise to the reader.

Now, the question is, how shall we use this table? Let's look at the code:

(defun kmp-match (pat str)
(let ((s 0)
(p 0)
(ff (kmp-table pat))
(loop :while (< s (length str)) :do
(if (= (char pat p) (char str s))
;; if the current characters of the pattern and string match
(if (= (1+ p) (length pat)))
;; if we reached the end of the pattern - success
(return (- s p))
;; otherwise, match the subsequent characters
(:= p (1+ p)
s (1+ s)))
;; if the characters don't match
(if (= -1 (? ff p))
;; shift the pattern for the whole length
(:= p 0
;; and skip to the next char in the string
s (1+ s))
;; try matching the current char again,
;; shifting the pattern to align the prefix
;; with the already matched part
(:= p (? ff p)))))))

As we see, the index in the string (s), is incremented at each iteration except when the entry in the table is positive. In the latter case, we may examine the same character more than once but not more than we have advanced in the pattern. And the advancement in the pattern meant the same advancement in the string (as the match is required for the advancement). In other words, we can backtrack not more than n times over the whole algorithm runtime, so the worst-case number of operations in kmp-search is 2n, while the best-case is just n. Thus, the total complexity is O(n + m).

And what will happen in our aa..ab example? The failure function for it will look like the following: -1 -1 -1 -1 (- m 2). Once we reach the first mismatch, we'll need to backtrack by 1 character, perform the comparison, which will mismatch, advance by 1 character (to b), mismatch again, again backtrack by 1 character, and so on until the end of the string. So, this case, will have almost the abovementiond 2n runtime.

To conclude, the optimization of KMP lies in excluding unnecessary repetition of the same operations by memoizing the results of partial computations - both in table-building and matching parts. The next chapter of the book will be almost exclusively dedicated to studying this approach in algorithm design.

BM

Boyer-Moore algorithm is conceptually similar to KMP, but it matches from the end of the pattern. It also builds a table, or rather three tables, but using a different set of rules, which also involve the characters in the string we search. More precisely, there are two basic rules instead of one for KMP. Besides, there's another rule, called the Galil rule, that is required to ensure the linear complexity of the algorithm. Overall, BM is pretty complex in the implementation details and also requires more preprocessing than KMP, so its utility outweighs these factors only when the search is repeated multiple times for the same pattern.

Overall, BM may be faster with normal text (and the longer the pattern, the faster), while KMP will work the best with strings that have a short alphabet (like DNA). However, I would choose KMP as the default due to its relative simplicity and much better space utilization.

RK

Now, let's talk about alternative approaches that rely on techniques other than pattern preprocessing. They are usually used to find matches of multiple patterns in one go as, for the base case, their performance will be worse than that of the previous algorithms.

Rabin-Karp algorithm uses an idea of the Rolling hash. It is a hash function that can be calculated incrementally. The RK hash is calculated for each substring of the length of the pattern. If we were to calculate a normal hash function like fnv-1, we'd need to use each character for the calculation - resulting in O(n * m) complexity of the whole procedure. The rolling hash is different as it requires, at each step of the algorithm, to perform just 2 operations: as the "sliding window" moves over the string, subtract the part of the hash corresponding to the character that is no longer part of the substring and add the new value for the character that has just become the part of the substring.

Here is the skeleton of the RK algorithm:

(defun rk-match (pat str)
(let ((len (length pat))
(phash (rk-hash pat)))
(loop :for i :from len :to (length str)
:for beg := (- i len)
:for shash := (rk-hash (slice str 0 len))
:then (rk-rehash len shash (char str beg) (char str i))
:when (and (= phash shash)
(string= pat (slice str beg len))
:collect beg)))

A trivial rk-hash function would be just:

(defun rk-hash (str)
(loop :for ch :across str :sum (char-code ch)))

But it is, obviously, not a good hash-function as it doesn't ensure the equal distribution of hashes. Still, in this case, we need a reversible hash-function. Usually, such hashes add position information into the mix. An original hash-function for the RK algorithm is the Rabin fingerprint that uses random irreducible polynomials over Galois fields of order 2. The mathematical background needed to explain it is somewhat beyond the scope of this book. However, there are simpler alternatives such as the following:

(defun rk-hash (str)
(assert (> (length str) 0))
(let ((rez (char-code (char str 0))))
(loop :for ch :across (slice str 1) :do
(:= rez (+ (rem (* rez 256) 101)
(char-code ch))))
(rem rez 101))

Its basic idea is to treat the partial values of the hash as the coefficients of some polynomial.

The implementation of rk-rehash for this function will look like this:

(defun rk-rehash (hash len ch1 ch2)
(rem (+ (* (+ hash 101
(- (rem (* (char-code ch1)
(expt 256 (1- len)))
101)))
256)
(char-code ch2))
101))

Our rk-match could be used to find many matches of a single pattern. To adapt it for operating on multiple patterns at once, we'll just need to pre-calculate the hashes for all patterns and lookup the current rk-hash value in this set. Additional optimization of this lookup may be performed with the help of a Bloom filter - a stochastic data structure we'll discuss in more detail later.

Finally, it's worth noting that there are other similar approaches to the rolling hash concept that trade some of the uniqueness properties of the hash function for the ability to produce hashes incrementally or have similar hashes for similar sequences. For instance, the Perceptual hash (phash) is used to find near-match images.

AC

Aho-Corasick is another algorithm that allows matching multiple strings at once. The preprocessing step of the algorithm constructs a Finite-State Machine (FSM) that resembles a trie with additional links between the various internal nodes. The FSM is a graph data structure that encodes possible states of the system and actions needed to transfer it from one state to the other.

The AC FSM is constructed in the following manner:

  1. Build a trie of all the words in the set of search patterns (the search dictionary). This trie represents the possible flows of the program when there's a successful character match at the current position. Add a loop edge for the root node.
  2. Add backlinks transforming the trie into a graph. The backlinks are used when a failed match occurs. These backlinks are pointing either to the root of the trie or if there are some prefixes that correspond to the part of the currently matched path - to the end of the longest prefix. The longest prefix is found using BFS of the trie. This approach is, basically, the same idea used in KMP and BM to avoid reexamining the already matched parts. So backlinks to the previous parts of the same word are also possible.

Here is the example FSM for the search dictionary '("the" "this" "that" "it" "his"):

Basically, it's just a trie with some backlinks to account for already processed prefixes. One more detail missing for this graph to be a complete FSM is an implicit backlink from all nodes without an explicit backlink that don't have backlinks to the root node.

The main loop of the algorithm is rather straightforward, examine each character and then:

As we see from the description, the complexity of the main loop is linear in the length of the string: at most, 2 matches are performed, for each character. The FSM construction is also linear in the total length of all the words in the search dictionary.

The algorithm is often used in antivirus software to perform an efficient search for code signatures against a database of known viruses. It also formed the basis of the original Unix command fgrep. And, from my point of view, it's the simplest to understand yet pretty powerful and versatile substring search algorithm that may be a default choice if you ever have to implement one yourself.

Regular Expressions

Searching is, probably, the most important advanced string operation. Besides, it is not limited to mere substring search - matching of more complex patterns is even in higher demand. These patterns, which are called "regular expressions" or, simply, regexes, may include optional characters, repetition, alternatives, backreferences, etc. Regexes play an important role in the history of the Unix command-line, being the principal technology of the infamous grep utility, and then the cornerstone of Perl. All modern programming languages support them either in the standard library or, as Lisp, with high-quality third-party addons (cl-ppcre).

One of my favorite programming books, "Beautiful Code", has a chapter on implementing simple regex matching from Brian Kernighan with code written by Rob Pike. It shows how easy it is to perform basic matching of the following patterns:

c    matches any literal character c
. matches any single character
^ matches the beginning of the input string
$ matches the end of the input string
* matches zero or more occurrences of the previous character

Below the C code from the book is translated into an equivalent Lisp version:

(defun match (regex text)
"Search for REGEX anywhere in TEXT."
(if (starts-with "^" regex) ; STARTS-WITH is from RUTILS
(match-here (slice regex 1) text)
(dotimes (i (length text))
(when (match-here regex (slice text i))
(return t)))))

(defun match-here (regex text)
"Search for REGEX at beginning of TEXT."
(cond ((= 0 (length regex))
t)
((and (> (length regex) 1)
(char= #\* (char regex 1)))
(match-star (char regex 1) (slice regex 2) text))
((string= "$" regex)
(= 0 (length text)))
((and (> (length text) 0)
(member (char text 0) (list #\. (char text 0)))
(match-here (slice regex 1) (slice text 1)))))

(defun match-star (c regex text)
"Search for C*REGEX at beginning of TEXT."
(loop
(when (match-here regex text) (return t))
(:= text (slice text 1))
(unless (and (> (length text) 0)
(member c (list #\. (char text 0))))
(return)))

This is a greedy linear algorithm. However, modern regexes are much more advanced than this naive version. They include such features as register groups (to record the spans of text that match a particular subpattern), backreferences, non-greedy repetition, and so on and so forth. Implementing those will require changing the simple linear algorithm to a backtracking one. And incorporating all of them would quickly transform the code above into a horrible unmaintainable mess: not even due to the number of cases that have to be supported but due to the need of accounting for the complex interdependencies between them.

And, what's worse, soon there will arise a need to resort to backtracking. Yet, a backtracking approach has a critical performance flaw: potential exponential runtime for certain input patterns. For instance, the Perl regex engine (PCRE) requires over sixty seconds to match a 30-character string aa..a against the pattern a?{15}a{15} (on standard hardware). While the alternative approach, which we'll discuss next, requires just twenty microseconds - a million times faster. And it handles a 100-character string of a similar kind in under 200 microseconds, while Perl would require over 1015 years.[2]

This issue is quite severe and has even prompted Google to release their own regex library with strict linear performance guarantees - RE2. The goal of the library is not to be faster than all other engines under all circumstances. Although RE2 guarantees linear-time performance, the linear-time constant varies depending on the overhead entailed by its way of handling of the regular expression. In a sense, RE2 behaves pessimistically whereas backtracking engines behave optimistically, so it can be outperformed in various situations. Also, its goal is not to implement all of the features offered by PCRE and other engines. As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported.

The figures above are taken from a seminal article by Russ Cox. He goes on to add:

Historically, regular expressions are one of computer science's shining examples of how using good theory leads to good programs. They were originally developed by theorists as a simple computational model, but Ken Thompson introduced them to programmers in his implementation of the text editor QED for CTSS. Dennis Ritchie followed suit in his own implementation of QED, for GE-TSS. Thompson and Ritchie would go on to create Unix, and they brought regular expressions with them. By the late 1970s, regular expressions were a key feature of the Unix landscape, in tools such as ed, sed, grep, egrep, awk, and lex. Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools.

The linear-time approach to regex matching relies on a similar technic to the one in the Aho-Corasick algorithm - the FSM. Actually, if by regular expressions we mean a set of languages that abide by the rules of the regular grammars in the Chomsky hierarchy of languages, the FSM is their exact theoretical computation model. Here is how an FSM for a simple regex a*b$ might look like:

Such FSM is called an NFA (Nondeterministic Finite Automaton) as some states have more than one alternative successor. Another type of automata are DFAs (Deterministic Finite Automata) that permit transitions to at most one state, for each state. The method to transform the regex into an NFA is called the Thompson's construction. And an NFA can be made into a DFA by the Powerset construction and then be minimized to get an optimal automaton. DFAs are more efficient to execute than NFAs, because DFAs are only ever in one state at a time: they never have a choice of multiple next states. But the construction takes additional time. Anyway, both NFAs and DFAs guarantee linear-time execution.

The Thompson's algorithm builds the NFA up from partial NFAs for each subexpression, with a different construction for each operator. The partial NFAs have no matching states: instead, they have one or more dangling arrows, pointing to nothing. The construction process will finish by connecting these arrows to a matching state.

Counting the states in the above constructions, we can see that this technic creates exactly one state per character or metacharacter in the regular expression. The only exception is the constructs c{n} or c{n,m} which require to duplicate the single chracter automaton n or m times respectively, but it is still a constant number. Therefore the number of states in the final NFA is at most equal to the length of the original regular expression plus some constant.

Implementation of the Thompson's Construction

The core of the algorithm could be implemented very transparently with the help of the Lisp generic functions. However, to enable their application, we'd first need to transform the raw expression into a sexp (tree-based) form. Such representation is supported, for example, in the cl-ppcre library:

PPCRE> (parse-string "ab[0-9]+c$")
(:SEQUENCE "ab" (:GREEDY-REPETITION 1 NIL (:RANGE #\0 #\9)) #\c :END-ANCHOR)

Parsing is a whole separate topic that will be discussed next. But once we have performed it, we gain a possibility to straightforwardly implement the Thompson's construction by traversing the parse tree and emitting, for each state, the corresponding part of the automaton. The Lisp generic functions are a great tool for implementing such transformation as they allow to define methods that are selected based on either the type or the identity of the arguments. And those methods can be added independently, so the implementation is clear and extensible. We will define 2 generic functions: one to emit the automaton fragment (th-part) and another to help in transition selection (th-match).

First, let's define the state node of the FSM. We will use a linked graph representation for the automaton. So, a variable for the FSM in the code will point to its start node, and it will, in turn, reference the other nodes. There will also be a special node that will be responsible for recording the matches (*matched-state*).

(defstruct th-state
transitions)

(defparameter *initial-state* nil)
(defparameter *matched-state* (make-th-state))

(defun th-state (&rest transitions)
"A small convenience function to construct TH-STATE structs."
(make-th-state :transitions (loop :for (cond state) :on transitions :by 'cddr
:collect (pair cond state))))

And now, we can define the generic function that will emit the nodes:

(define-condition check-start-anchor () ())

(defgeneric th-part (next-state kind &rest args)
(:documentation
"Emit the TH-STATE structure of a certain KIND
(which may be a keyword or a raw string) using the other ARGS
and pointing to NEXT-STATE struct.")
(:method (next-state (kind (eql :sequence)) &rest args)
(apply 'th-part (if (rest args)
(apply 'th-part :sequence (rest args))
next-state)
(first args)))
(:method (next-state (kind (eql :greedy-repetition)) &rest args)
;; this method can handle *, +, {n}, and {n,m} regex modifiers
;; in any case, there's a prefix sequence of fixed nonnegative length
;; of identical elements that should unconditionally match
;; followed by a bounded or unbounded sequence that,
;; in case of a failed match, transitions to the next state
(apply 'th-part
(let ((*initial-state* next-state))
(apply 'th-part next-state :sequence
(loop :repeat (or (second args) 1)
:collect (mklist (third args)))))
:sequence (loop :repeat (first args)
:collect (mklist (third args)))))
(:method (next-state (kind character) &rest args)
(th-state kind next-state
;; Usually, *initial-state* will be nill, i.e. further computations
;; alone this path will be aborted, but for some variants (? or *)
;; they will just continue normally to the next state.
;; The special variable allows to control this as you can see in
;; the method for :greedy-repetition
t *initial-state*))
(:method (next-state (kind (eql :end-anchor)) &rest args)
(th-state nil *matched-state*
t *initial-state*))
(:method (next-state (kind (eql :start-anchor)) &rest args)
;; This part is unique in that all the other parts consume the next character
;; (we're not implementing lookahead here), but this one shouldn't.
;; To implement such behavior without the additional complexity created by passing
;; the string being searched to this function (which we'll still, probably, need to do
;; later on, but were able to avoid so far), we can resort to a cool Lisp technique
;; of signaling a condition that can be handled specially in the top-level code
(signal 'check-start-anchor)
next-state))

Here, we have defined some of the methods of th-part that specialize for the basic :sequence of expressions, :greedy-repetition (regex * and +), a single character and single symbols :start-anchor/:end-anchor (regexes ^ and $). As you can see, some of them dispatch (are chosen based on) the identity of the first argument (using eql specializers), while the character-related method specializes on the class of the arg. As we develop this facility, we could add more methods with defmethod. Running th-part on the whole parse-tree will produce the complete automaton, we don't need to do anything else!

To use the constructed FSM, we run it with the string as input. NFAs are endowed with the ability to guess perfectly when faced with a choice of next state: to run the NFA on a real computer, we must find a way to simulate this guessing. One way to do that is to guess one option, and if that doesn't work, try the other. A more efficient way to simulate perfect guessing is to follow all admissible paths simultaneously. In this approach, the simulation allows the machine to be in multiple states at once. To process each letter, it advances all the states along all the arrows that match the letter. In the worst case, the NFA might be in every state at each step, but this results in at worst a constant amount of work independent of the length of the string, so arbitrarily large input strings can be processed in linear time. The efficiency comes from tracking the set of reachable states but not which paths were used to reach them. In an NFA with n nodes, there can only be n reachable states at any step.

(defun run-nfa (nfa str)
(let ((i 0)
(start 0)
(matches (list))
(states (list nfa)))
;; this is the counterpart for the start-anchor signal
(handler-bind ((check-start-anchor
;; there's no sense to proceed matching a ^... regex
;; if the string is not at its start
(lambda (c) (when (> i 0) (return-from run-nfa)))))
(dovec (char (concatenate 'vector str
#(nil))) ; for handling end-anchor
(let ((new-states (list)))
(dolist (state states)
(dolist (tr (? state 'transitions))
(when (th-match tr char)
(case (rt tr)
(*matched-state* (push start matches))
(nil ) ; ignore it
(t (pushnew (rt tr) new-states)))
(return))))
(if new-states
(:= states new-states)
(:= states (list nfa)
start nil)))
(:+ i)
(unless start (:= start i))))
matches))

The th-match function may have methods to match a single char and a character range, as well as a particular predicate. Its implementation is trivial and left as an exercise to the reader.

Overall, interpreting an automaton is a simple and robust approach, yet if we want to squeeze all the possible performance, we can compile it directly to machine code. This is much easier to do with the DFA as it has at most 2 possible transitions from each state, so the automaton can be compiled to a multi-level conditional and even a jump-table.

Grammars

Regexes are called "regular" for a reason: there's a corresponding mathematical formalism "regular languages" that originates from the hierarchy of grammars compiled by Noah Chomsky. This hierarchy has 4 levels, each one allowing strictly more complex languages to be expressed with it. And for each level, there's an equivalent computation model:

We have already discussed the bottom layer of the hierarchy. Regular languages are the most limited (and thus the simplest to implement): for example, you can write a regex a{15}b{15}, but you won't be able to express a{n}b{n} for an arbitrary n, i.e. ensure that b is repeated the same number of times as a. The top layer corresponds to all programs and so all the programming science and lore, in general, is applicable to it. Now, let's talk about context-free grammars which are another type that is heavily used in practice and even has a dedicated set of algorithms. Such grammars can be used not only for simple matching but also for parsing and generation. Parsing, as we have seen above, is the process of transforming a text that is assumed to follow the rules of a certain grammar into the structured form that corresponds to the particular rules that can be applied to this text. And generation is the reverse process: applying the rules, obtain the text. This topic is huge and there's a lot of literature on it including the famous Dragon Book.

Parsing is used for processing both artificial (including programming) and natural languages. And, although different sets of rules may be used, as well as different approaches for selecting a particular rule, the resulting structure will be a tree. In fact, formally, each grammar consists of 4 items:

Each compiler has to use parsing as a step in transforming the source into executable code. Also, parsing may be applied for any data format (for instance, JSON) to transform it into machine data. In natural language processing, parsing is used to build the various tree representations of the sentence, which encode linguistic rules and structure.

There are many different types of parsers that differ in the additional constraints they impose on the structure of the production rules of the grammar. The generic context-free constraint is that in each production rule the left-hand side may only be a single nonterminal. The most wide-spread of context-free grammars are LL(k) (in particular, LL(1)) and LR (LR(1), SLR, LALR, GLR, etc). For example, LL(1) parsers (one of the easiest to build) parses the input from left to right, performing leftmost derivation of the sentence, and it is allowed to look ahead at most 1 character. Not all combinations of derivation rules allow the algorithm to build a parser that will be able to perform unambiguous rule selection under such constraints. But, as the LL(1) parsing is simple and efficient, some authors of grammars specifically target their language to be LL(1)-parseable. For example, Pascal and other programming languages created by Niklas Wirth fall into this category.

There are also two principal approaches to implementing the parser: a top-down and a bottom-up one. In a top-down approach, the parser tries to build the tree from the root, while, in a bottom-up one, it tries to find the rules that apply to groups of terminal symbols and then combine those until the root symbol is reached. Obviously, we can't enumerate all parsing algorithms here, so we'll study only a single approach, which is one of the most wide-spread, efficient, and flexible ones - Shift-Reduce Parsing. It's a bottom-up linear algorithm that can be considered one of the instances of the pushdown automaton approach - a theoretical computational model for context-free grammars.

A shift-reduce parser operates on a queue of tokens of the original sentence. It also has access to a stack. At each step, the algorithm can perform:

Thus, for each token, it will perform exactly 2 "movement" operations: push it onto the stack and pop from the stack. Plus, it will perform rule lookup, which requires a constant number of operations (maximum length of the rhs of any rule) if an efficient structure is used for storing the rules. A hash-table indexed by the rhs's or a trie are good choices for that.

Here's a small example from the domain of NLP syntactic parsing. Let's consider a toy grammar:

S -> NP VP .
NP -> DET ADJ NOUN
NP -> PRP$ NOUN ; PRP$ is a posessive pronoun
VP -> VERB VP
VP -> VERB NP

and the following vocabulary:

DET -> a|an|the
NOUN -> elephant|pijamas
ADJ -> large|small
VERB -> is|wearing
PRP$ -> my

No, let's parse the sentence (already tokenized): A large elephant is wearing my pyjamas . First, we'll need to perform part-of-speech tagging, which, in this example, is a matter of looking up the appropriate nonterminals from the vocabulary grammar. This will result in the following:

DET ADJ    NOUN  VERB  VERB PRP$  NOUN  .
| | | | | | | |
A large elephant is wearing my pyjamas .

This POS tags will serve the role of terminals for our parsing grammar. Now, the shift-reduce process itself begins:

1. Initial queue: (DET ADJ NOUN VERB VERB PRP$ NOUN .)
Initial stack: ()
Operation: shift


2. Queue: (ADJ NOUN VERB VERB PRP$ NOUN .)
Stack: (DET)
Operation: shift (as there are no rules with the rhs DET)

3. Queue: (NOUN VERB VERB PRP$ NOUN .)
Stack: (ADJ DET)
Operation: shift

4. Queue: (VERB VERB PRP$ NOUN .)
Stack: (NOUN ADJ DET)
Operation: reduce (rule NP -> DET ADJ NOUN)
; we match the rules in reverse to the stack

5. Queue: (VERB VERB PRP$ NOUN .)
Stack: (NP)
Operation: shift

6. Queue: (VERB PRP$ NOUN .)
Stack: (VERB NP)
Operation: shift

7. Queue: (PRP$ NOUN .)
Stack: (VERB VERB NP)
Operation: shift

8. Queue: (NOUN .)
Stack: (PRP$ VERB VERB NP)
Operation: shift

9. Queue: (.)
Stack: (NOUN PRP$ VERB VERB NP)
Operation: reduce (rule: NP -> PRP$ NOUN)

10. Queue: (.)
Stack: (NP VERB VERB NP)
Operation: reduce (rule: VP -> VERB NP)

11. Queue: (.)
Stack: (VP VERB NP)
Operation: reduce (rule: VP -> VERB VP)

12. Queue: (.)
Stack: (VP NP)
Operation: shift

11. Queue: ()
Stack: (. VP NP)
Operation: reduce (rule: S -> NP VP .)

12. Reached root symbol - end.

The resulting parse tree is:

__________S___________
/ \ \
/ __VP__ \
/ / \ \
/ / __VP_ \
/ / / \ \
___NP_____ / / _NP_ \
/ | \ / / / \ \
DET ADJ NOUN VERB VERB PRP$ NOUN .
| | | | | | | |
A large elephant is wearing my pyjamas .

The implementation of the basic algorithm is very simple:

(defstruct grammar
rules
max-length)

(defmacro grammar (&rest rules)
`(make-grammar
:rules (pairs->ht (mapcar (lambda (rule)
(pair (nthcdr 2 rule) (first rule)))
',rules))
:max-length
(let ((max 0))
(dolist (rule ',rules)
(when (> #1=(length (nthcdr 2 rule)) max)
(:= max #1#)))
max))) ; #1= & #1# are reader-macros for anonymous variables

(defun parse (grammar queue)
(let ((stack (list)))
(loop :while queue :do
(print stack) ; diagnostic output
(if-it (find-rule stack grammar)
;; reduce
(dotimes (i (length (cdr it))
(push it stack))
(pop stack))
;; shift
(push (pop queue) stack))
:finally (return (find-rule stack grammar)))))

(defun find-rule (stack grammar)
(let (prefix)
(loop :for item in stack
:repeat (? grammar 'max-length) :do
(push (car (mklist item)) prefix)
(when-it (? grammar 'rules prefix)
;; otherwise parsing will fail with a stack
;; containing a number of partial subtrees
(return (cons it (reverse (subseq stack 0 (length prefix)))))))))

CL-USER> (parse (print (grammar (S -> NP VP |.|)
(NP -> DET ADJ NOUN)
(NP -> PRP$ NOUN)
(VP -> VERB VP)
(VP -> VERB NP)))
'(DET ADJ NOUN VERB VERB PRP$ NOUN |.|))
#S(GRAMMAR
:RULES #{
'(NP VP |.|) S
'(DET ADJ NOUN) NP
'(PRP$ NOUN) NP
'(VERB VP) VP
'(VERB NP) VP
}
:MAX-LENGTH 3)
NIL
(DET)
(ADJ DET)
(NOUN ADJ DET)
((NP DET ADJ NOUN))
(VERB (NP DET ADJ NOUN))
(VERB VERB (NP DET ADJ NOUN))
(PRP$ VERB VERB (NP DET ADJ NOUN))
(NOUN PRP$ VERB VERB (NP DET ADJ NOUN))
((NP PRP$ NOUN) VERB VERB (NP DET ADJ NOUN))
((VP VERB (NP PRP$ NOUN)) VERB (NP DET ADJ NOUN))
((VP VERB (VP VERB (NP PRP$ NOUN))) (NP DET ADJ NOUN))
(S (NP DET ADJ NOUN) (VP VERB (VP VERB (NP PRP$ NOUN))) |.|)

However, the additional level of complexity of the algorithm arises when the grammar becomes ambiguous, i.e. there may be situations when several rules apply. Shift-reduce is a greedy algorithm, so, in its basic form, it will select some rule (for instance, with the shortest rhs or just the first match), and it cannot backtrack. This may result in a parsing failure. If some form of rule weights is added, the greedy selection may produce a suboptimal parse. Anyway, there's no option of backtracking to correct a parsing error. In the NLP domain, the peculiarity of shift-reduce parsing application is that the number of rules is quite significant (it can reach thousands) and, certainly, there's ambiguity. In this setting, shift-reduce parsing is paired with machine learning technics, which perform a "soft" selection of the action to take at each step, as reduce is applicable almost always, so a naive greedy technique becomes pointless.

Actually, shift-reduce would better be called something like stack-queue parsing, as different parsers may not limit the implementation to just the shift and reduce operations. For example, an NLP parser that allows the construction of non-projective trees (those, where the arrows may cross, i.e. subsequent words may not always belong to a single or subsequent upper-level categories), adds a swap operation. A more advanced NLP parser that produces a graph structure called an AMR (abstract meaning representation) has 9 different operations.

Shift-reduce parsing is implemented in many of the parser generator tools, which generate a parser program from a set of production rules. For instance, the popular Unix tool yacc is a LALR parser generator that uses shift-reduce. Another popular tool ANTLR is a parser generator for LL(k) languages that uses a non-shift-reduce direct pushdown automaton-based implementation.

Besides shift-reduce and similar automata-based parsers, there are many other parsing technics used in practice. For example, CYK probabilistic parsing was popular in NLP for some time, but it's an O(n^3) algorithm, so it gradually fell from grace and lost to machine-learning enhanced shift-reduce variants. Another approach is packrat parsing (based on PEG - parsing expression grammars) that has a great Lisp parser-generator library esrap. Packrat is a more powerful top-down parsing approach with backtracking and unlimited lookahead that nevertheless guarantees linear parse time. Any language defined by an LL(k) or LR(k) grammar can be recognized by a packrat parser, in addition to many languages that conventional linear-time algorithms do not support. This additional power simplifies the handling of common syntactic idioms such as the widespread but troublesome longest-match rule, enables the use of sophisticated disambiguation strategies such as syntactic and semantic predicates, provides better grammar composition properties, and allows lexical analysis to be integrated seamlessly into parsing. The last feature makes packrat very appealing to the programmers as they don't have to define separate tools for lexical analysis (tokenization and token categorization) and parsing. Moreover, the rules for tokens use the same syntax, which is also quite similar to regular expression syntax. For example, here's a portion of the esrap rules for parsing tables in Markdown documents. The Markdown table may look something like this:

| Left-Aligned  | Center Aligned  | Right Aligned |
| :------------ |:---------------:| -----:|
| col 3 is | some wordy text | $1600 |
| col 2 is | centered | $12 |
| zebra stripes | are neat | $1 |

You can see that the code is quite self-explanatory: each defrule form consists of a rule name (lhs), its rhs, and a transformation of the rhs into a data structure. For instance, in the rule table-row the rhs is (and (& #\|) (+ table-cell) #\| sp newline). The row should start with a | char followed by 1 or more table-cells (a separate rule), and ended by | with some space charctaers and a newline. And the transformation (:destructure (_ cells &rest __) ... only cares about the content, i.e. the table cells.

(defrule sp (* space-char)
(:text t))

(defrule table-cell (and #\|
sp
(* (and (! (or (and sp #\|) endline)) inline))
sp
(& #\|))
(:destructure (_ __ content &rest ___)
(mapcar 'second content)))

(defrule table-row (and (& #\|) (+ table-cell) #\| sp newline)
(:destructure (_ cells &rest __)
(mapcar (lambda (a) (cons :plain a))
cells)))

(defrule table-align-cell (and sp (? #\:) (+ #\-) (? #\:) sp #\|)
(:destructure (_ left __ right &rest ___)
(if right (if left 'center 'right) (when left 'left))))

(defrule table-align-row (and #\| (+ table-align-cell) sp newline)
(:destructure (_ aligns &rest __)
aligns))

(defrule table-head (and table-row table-align-row))

To conclude the topic of parsing, I wanted to pose a question: can it be used to match the regular expressions? And the answer, of course, is that it can, as we are operating in a more powerful paradigm that includes the regexes as a subdomain. However, the critical showstopper of applying parsing to this problem is the need to define the grammar instead of writing a compact and more or less intuitive regex...

String Search in Action: Plagiarism Detection

Plagiarism detection is a very challenging problem that doesn't have an exact solution. The reason is that there's no exact definition of what can be considered plagiarism and what can't, the boundary is rather blurry. Obviously, if the text or its part is just copy-pasted, everything is clear. But, usually (and especially when they know that plagiarism detection is at play), people will apply their creativity to alter the text in some slight or even significant ways. However, over the years, researchers have come up with numerous algorithms of plagiarism detection, with quality good enough to be used in our educational institutions. The problem is very popular and there are even shared task challenges dedicated to improving plagiarism catchers. It's somewhat an arms race between the plagiarists and the detection systems.

One of the earliest but, still, quite effective ways of implementing plagiarism detection is the Shingle algorithm. It is also based on the idea of using hashes and some basic statistical sampling techniques. The algorithm operates in the following stages:

  1. Text normalization (this may include case normalization, reduction of the words to basic forms, error correction, cleanup of punctuation, stopwords, etc.)
  2. Selection of the shingles and calculation of their hashes.
  3. Sampling the shingles from the text at question.
  4. Comparison of the hashes of the original shingles to the sampled hashes and evaluation.

The single shingle is a continues sequence of words from the normalized text (another name for this object, in NLP, is ngram). The original text will give us (1- n) shingles, where n is the number of words. The hashes of the shingles are normal string hashes (like fnv-1).

The text, which is analyzed for plagiarism, is also split into shingles, but not all of them are used. Just a random sample of m. The Sampling theorem can give a good estimate of the number that can be trusted with a high degree of confidence. For efficient comparison, all the original hashes can be stored in a hash-set. If the number of overlapping shingles exceeds some threshold, the text can be considered plagiarised. The other take on the result of the algorithm application may be to return the plagiarism degree, which will be the percentage of the overlapping shingles. The complexity of the algorithm is O(n + m).

In a sense, the Shingle algorithm may be viewed as an instance of massive string search, where the outcome we're interested in is not so much the positions of the patterns in the text (although, those may also be used to indicate the parts of the text that are plagiarism-suspicious) as the fact that they are present in it.

Take-aways

Strings are peculiar objects: initially, it may seem that they are just arrays. But, beyond this simple understanding, due to the main usage patterns, a much more interesting picture can be seen. Advanced string representations and algorithms are examples of special-purpose optimization applied to general-purpose data structures. This is another reason why strings are presented at the end of the part on derived data structures: string algorithms make heavy use of the material we have covered previously, such as trees and graphs.

We have also discussed the FSMs - a powerful data-structure that can be used to reliably implement complex workflows. FSMs may be used not only for string matching but also for implementing protocol handling (for example, in the HTTP server), complex user interactions, and so on. The Erlang programming language even has a standard library behavior gen_fsm (replaced by the newer gen_statem) that is a framework for easy implementation of FSMs - as many Erlang applications are mass service systems that have state machine-like operation.

P.S. Originally, I expected this chapter to be one of the smallest in the book, but it turned out to be the longest one. Strings are not so simple as they might seem... ;)


[1] A proper suffix is a suffix that is at least one character shorter than the string itself. For example, in the string abc the proper suffices are bc and c.

[2] Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the same applies to Python, or PHP, or Ruby, or many other languages.

20 Nov 2019 10:47am GMT

26 Apr 2019

feedPlanet Sun

First Image of a Black Hole – Event Horizon

The Event Horizon Telescope (EHT) - a planet-scale array of 8 ground-based radio telescopes and part of an international collaboration - captured the first image of a black hole. On April 10th 2019, EHT researchers disclosed the first direct visual evidence of a supermassive black hole in the heart of the Galaxy Messier 87.

26 Apr 2019 2:32am GMT

04 Nov 2018

feedPlanet Sun

5 Budget-Friendly Telescopes You Can Choose For Viewing Planets

Socrates couldn't have been more right when he said: "I know one thing, that I know nothing." Even with all of the advancements we, as a species, have made in this world, it's still nothing compared to countless of wonders waiting to be discovered in the vast universe. If you've recently developed an interest in ... Read more

04 Nov 2018 1:27pm GMT

20 May 2012

feedPlanet Sun

Annular Solar Eclipse on Sunday, May 20th 2012

On Sunday, May 20th 2012, people in a narrow strip from Japan to the western United States will be able to see an annular solar eclipse, the first in 18 years. The moon will cover as much as 94% of the sun. An Annular Solar Eclipse is different from a Total Solar Eclipse, when the ... Read more

20 May 2012 9:51pm GMT

10 Nov 2011

feedPlanetJava

OSDir.com - Java: Oracle Introduces New Java Specification Requests to Evolve Java Community Process

From the Yet Another dept.:

To further its commitment to the Java Community Process (JCP), Oracle has submitted the first of two Java Specification Requests (JSRs) to update and revitalize the JCP.

10 Nov 2011 6:01am GMT

OSDir.com - Java: No copied Java code or weapons of mass destruction found in Android

From the Fact Checking dept.:

ZDNET: Sometimes the sheer wrongness of what is posted on the web leaves us speechless. Especially when it's picked up and repeated as gospel by otherwise reputable sites like Engadget. "Google copied Oracle's Java code, pasted in a new license, and shipped it," they reported this morning.



Sorry, but that just isn't true.

10 Nov 2011 6:01am GMT

OSDir.com - Java: Java SE 7 Released

From the Grande dept.:

Oracle today announced the availability of Java Platform, Standard Edition 7 (Java SE 7), the first release of the Java platform under Oracle stewardship.

10 Nov 2011 6:01am GMT

08 Nov 2011

feedfosdem - Google Blog Search

papupapu39 (papupapu39)'s status on Tuesday, 08-Nov-11 00:28 ...

papupapu39 · http://identi.ca/url/56409795 #fosdem #freeknowledge #usamabinladen · about a day ago from web. Help · About · FAQ · TOS · Privacy · Source · Version · Contact. Identi.ca is a microblogging service brought to you by Status.net. ...

08 Nov 2011 12:28am GMT

05 Nov 2011

feedfosdem - Google Blog Search

Write and Submit your first Linux kernel Patch | HowLinux.Tk ...

FOSDEM (Free and Open Source Development European Meeting) is a European event centered around Free and Open Source software development. It is aimed at developers and all interested in the Free and Open Source news in the world. ...

05 Nov 2011 1:19am GMT

03 Nov 2011

feedfosdem - Google Blog Search

Silicon Valley Linux Users Group – Kernel Walkthrough | Digital Tux

FOSDEM (Free and Open Source Development European Meeting) is a European event centered around Free and Open Source software development. It is aimed at developers and all interested in the Free and Open Source news in the ...

03 Nov 2011 3:45pm GMT

28 Oct 2011

feedPlanet Ruby

O'Reilly Ruby: MacRuby: The Definitive Guide

Ruby and Cocoa on OS X, the iPhone, and the Device That Shall Not Be Named

28 Oct 2011 8:00pm GMT

14 Oct 2011

feedPlanet Ruby

Charles Oliver Nutter: Why Clojure Doesn't Need Invokedynamic (Unless You Want It to be More Awesome)

This was originally posted as a comment on @fogus's blog post "Why Clojure doesn't need invokedynamic, but it might be nice". I figured it's worth a top-level post here.

Ok, there's some good points here and a few misguided/misinformed positions. I'll try to cover everything.

First, I need to point out a key detail of invokedynamic that may have escaped notice: any case where you must bounce through a generic piece of code to do dispatch -- regardless of how fast that bounce may be -- prevents a whole slew of optimizations from happening. This might affect Java dispatch, if there's any argument-twiddling logic shared between call sites. It would definitely affect multimethods, which are using a hand-implemented PIC. Any case where there's intervening code between the call site and the target would benefit from invokedynamic, since invokedynamic could be used to plumb that logic and let it inline straight through. This is, indeed, the primary benefit of using invokedynamic: arbitrarily complex dispatch logic folds away allowing the dispatch to optimize as if it were direct.

Your point about inference in Java dispatch is a fair one...if Clojure is able to infer all cases, then there's no need to use invokedynamic at all. But unless Clojure is able to infer all cases, then you've got this little performance time bomb just waiting to happen. Tweak some code path and obscure the inference, and kablam, you're back on a slow reflective impl. Invokedynamic would provide a measure of consistency; the only unforeseen perf impact would be when the dispatch turns out to *actually* be polymorphic, in which case even a direct call wouldn't do much better.

For multimethods, the benefit should be clear: the MM selection logic would be mostly implemented using method handles and "leaf" logic, allowing hotspot to inline it everywhere it is used. That means for small-morphic MM call sites, all targets could potentially inline too. That's impossible without invokedynamic unless you generate every MM path immediately around the eventual call.

Now, on to defs and Var lookup. Depending on the cost of Var lookup, using a SwitchPoint-based invalidation plus invokedynamic could be a big win. In Java 7u2, SwitchPoint-based invalidation is essentially free until invalidated, and as you point out that's a rare case. There would essentially be *no* cost in indirecting through a var until that var changes...and then it would settle back into no cost until it changes again. Frequently-changing vars could gracefully degrade to a PIC.

It's also dangerous to understate the impact code size has on JVM optimization. The usual recommendation on the JVM is to move code into many small methods, possibly using call-through logic as in multimethods to reuse the same logic in many places. As I've mentioned, that defeats many optimizations, so the next approach is often to hand-inline logic everywhere it's used, to let the JVM have a more optimizable view of the system. But now we're stepping on our own feet...by adding more bytecode, we're almost certainly impacting the JVM's optimization and inlining budgets.

OpenJDK (and probably the other VMs too) has various limits on how far it will go to optimize code. A large number of these limits are based on the bytecoded size of the target methods. Methods that get too big won't inline, and sometimes won't compile. Methods that inline a lot of code might not get inlined into other methods. Methods that inline one path and eat up too much budget might push out more important calls later on. The only way around this is to reduce bytecode size, which is where invokedynamic comes in.

As of OpenJDK 7u2, MethodHandle logic is not included when calculating inlining budgets. In other words, if you push all the Java dispatch logic or multimethod dispatch logic or var lookup into mostly MethodHandles, you're getting that logic *for free*. That has had a tremendous impact on JRuby performance; I had previous versions of our compiler that did indeed infer static target methods from the interpreter, but they were often *slower* than call site caching solely because the code was considerably larger. With invokedynamic, a call is a call is a call, and the intervening plumbing is not counted against you.

Now, what about negative impacts to Clojure itself...

#0 is a red herring. JRuby supports Java 5, 6, and 7 with only a few hundred lines of changes in the compiler. Basically, the compiler has abstract interfaces for doing things like constant lookup, literal loading, and dispatch that we simply reimplement to use invokedynamic (extending the old non-indy logic for non-indified paths). In order to compile our uses of invokedynamic, we use Rémi Forax's JSR-292 backport, which includes a "mock" jar with all the invokedynamic APIs stubbed out. In our release, we just leave that library out, reflectively load the invokedynamic-based compiler impls, and we're off to the races.

#1 would be fair if the Oracle Java 7u2 early-access drops did not already include the optimizations that gave JRuby those awesome numbers. The biggest of those optimizations was making SwitchPoint free, but also important are the inlining discounting and MutableCallSite improvements. The perf you see for JRuby there can apply to any indirected behavior in Clojure, with the same perf benefits as of 7u2.

For #2, to address the apparent vagueness in my blog post...the big perf gain was largely from using SwitchPoint to invalidate constants rather than pinging a global serial number. Again, indirection folds away if you can shove it into MethodHandles. And it's pretty easy to do it.

#3 is just plain FUD. Oracle has committed to making invokedynamic work well for Java too. The current thinking is that "lambda", the support for closures in Java 7, will use invokedynamic under the covers to implement "function-like" constructs. Oracle has also committed to Nashorn, a fully invokedynamic-based JavaScript implementation, which has many of the same challenges as languages like Ruby or Python. I talked with Adam Messinger at Oracle, who explained to me that Oracle chose JavaScript in part because it's so far away from Java...as I put it (and he agreed) it's going to "keep Oracle honest" about optimizing for non-Java languages. Invokedynamic is driving the future of the JVM, and Oracle knows it all too well.

As for #4...well, all good things take a little effort :) I think the effort required is far lower than you suspect, though.

14 Oct 2011 2:40pm GMT

07 Oct 2011

feedPlanet Ruby

Ruby on Rails: Rails 3.1.1 has been released!

Hi everyone,

Rails 3.1.1 has been released. This release requires at least sass-rails 3.1.4

CHANGES

ActionMailer

ActionPack

ActiveModel

ActiveRecord

ActiveResource

ActiveSupport

Railties

SHA-1

You can find an exhaustive list of changes on github. Along with the closed issues marked for v3.1.1.

Thanks to everyone!

07 Oct 2011 5:26pm GMT

26 Jul 2008

feedFOSDEM - Free and Open Source Software Developers' European Meeting

Update your RSS link

If you see this message in your RSS reader, please correct your RSS link to the following URL: http://fosdem.org/rss.xml.

26 Jul 2008 5:55am GMT

25 Jul 2008

feedFOSDEM - Free and Open Source Software Developers' European Meeting

Archive of FOSDEM 2008

These pages have been archived.
For information about the latest FOSDEM edition please check this url: http://fosdem.org

25 Jul 2008 4:43pm GMT

09 Mar 2008

feedFOSDEM - Free and Open Source Software Developers' European Meeting

Slides and videos online

Two weeks after FOSDEM and we are proud to publish most of the slides and videos from this year's edition.

All of the material from the Lightning Talks has been put online. We are still missing some slides and videos from the Main Tracks but we are working hard on getting those completed too.

We would like to thank our mirrors: HEAnet (IE) and Unixheads (US) for hosting our videos, and NamurLUG for quick recording and encoding.

The videos from the Janson room were live-streamed during the event and are also online on the Linux Magazin site.

We are having some synchronisation issues with Belnet (BE) at the moment. We're working to sort these out.

09 Mar 2008 3:12pm GMT