26 Jul 2010
Planet Ruby
Ruby on Rails: Rails 3.0: Release candidate!
High off Baltimore Pandemic and Yellow Tops, I believe we promised a release candidate shortly after RailsConf. As things usually go in open source, we gorged ourselves on fixes and improvements instead. But all to your benefit. We've had 842 commits by 125 authors since the release of the last beta!
Now it's time to just say good is good enough, otherwise we could keep on with this forever. So please welcome the Rails 3 release candidate! You install, as always, with gem install rails --pre.
Most of the fixes have been of minor significance, but we did manage to dramatically speed up Rails 3 development and startup speed for larger applications (Basecamp went from insufferable to about 2.3 levels of enjoyment).
Speed is now pretty good across the board except for part of Arel that Active Record now depends on. We'll be making sure we get performance of Active Record back to at least 2.3 levels before release.
A few more highlights:
- Support for the MySQL2 gem, which will take care of MySQL encoding issues on Ruby 1.9.2.
- Shallow routes are back.
- Fixed the autoload issues
- Made the rails command work even when you're in a subdirectory
- Dealt with a variety of web encoding issues
Indulge yourself in the delights of all the glorious details from the commit logs or checkout the slightly less pedantic summaries in the CHANGELOGs.
This release candidate of Rails 3 also concides with the release candidate of Bundler 1.0. Huge strides were made with Bundler and it should both be much faster and have most of the edge cases sawed off.
I've said "we're almost there" so many times that I'm almost exhausted. But really, guys, WE'RE ALMOST THERE!!!1
1 Just a few weeks before final is out?
26 Jul 2010 9:46pm GMT
24 Jul 2010
Planet Ruby
Phil Hagelberg: in which we watch while a veritable tower of babel is constructed
This week I was lucky enough to attend the Emerging Languages conference, a special event nestled snugly in a corner of O'Reilly's OSCON open source conference. Emerging Languages brought together language designers and implementors together to share the things that made their languages unique and to cross-pollinate ideas.

The presentations covered a wide variety of languages. The talk on Go was interesting in that it wasn't so much about Go as about the historical heritage of Go and the languages that led to it. A lot of the interesting ideas there came out of Tony Hoare's Communicating Sequential Processes paper, which looks fascinating. The talk on Frink was a delightful romp through the esoteric and very human world of unit calculation, though the licensing issues surrounding that language rule it out for most uses. It's been a while since I've done web work, but the talk on CoffeeScript made me hope that I never have to write another line of Javascript. I also got to see Charlie present on Mirah, formerly Duby, which I've posted about before: a language that gives you low-level bytecode-equivalent output to Java but reduces the pain/verbosity by offering a more reasonable syntax and type inference.
The second day started off strong with an engaging demo of the visual Kodu language. It's unique in that it's designed to run on XBox consoles and can be programmed entirely through the controller by manipulating icons. Once again there are licensing issues and it won't run on anything but Windows or an XBox, but in this case the developers are actively working towards remedying the problem. They have clearly put a lot of thought and research into keeping it engaging especially for kids.
After this Rich Hickey presented on a Clojure feature tentatively called Pods, which are the new name for an experiment he had discussed much earlier called cells. The gist is that while transients can be a boon to performance, they introduce mutability (albeit very constrained mutability) outside the reference model. Pods separate out a very clear reference policy where you're always dealing with persistent values coming in and out, and the change happens isolated inside the pod. (I posted a link to the above photo in the Clojure IRC channel, which caused it to erupt in cries of "what are pods?" and "where's the documentation for this?"-a reminder that there are many Clojurians who feel the need to constantly stay abreast of every latest change no matter how recent.)
Another highlight of the second day was the talk on Factor, a modern cousin to Forth with an excellent compiler and nice tooling. Factor's been on my radar for a while since stack-based languages really sound like an interesting twist to language design, and everything I read about their compiler seems to indicate it's very cutting-edge and well-designed. Factor was also the only language presented I want to learn that isn't hosted; that is, it compiles straight to the metal. The demo focused on showing some the ways that Factor retains an astonishing amount of flexibility and dynamicity even though it compiles to very fast machine code. The Emacs integration via FUEL also impressed me.
There were many, many more languages presented; they came at such a rate that if you blinked you'd look up to see the presentation half-through already. On the whole this was helpful since it forced presenters to focus on a "hook" or two to get you interested enough to dig deeper rather than give an overview of features which could easily be read from a web site, but such a wild ride left everyone with a minor case of mental whiplash.
It's been a while since I've attended an event that showcased this level of energy. I hope to look forward to attending Emerging Languages 2011.
24 Jul 2010 10:45pm GMT
Rails on PostgreSQL: Pivotal Labs Talk - Scaling a Rails App with Postgres
I'm slowly catching up with my podcast backlog and came across a Pivotal Labs talk from May 2009. In this talk Josh Susser and Damon McCormick are presenting on Scaling a Rails App with Postgres . It's a little dated now - this talk was given was when PostgreSQL 8.4 was in beta - but, still, lots of good stuff. Here are some notes:
- They started with an existing Rails app with lots of data, so they had some constraints - not greenfield development.
- Around the 5-6 minute mark there's a good discussion of PostgreSQL's query optimizer and how it analyzes a table's data distribution. One takeaway (mentioned around 16:20) is to run
vacuummore often on a particular table if there are a lot of writes. - 10:00 How to set STATISTICS for a particular table.
- 11:00 Using partial indexes.
- 14:00 Indexing on expressions.
- 18:10-23:00 A nice discussion of the
EXPLAINoutput. - 23:45 Here they talk about wide columns. I've seen this in MySQL as well, where splitting text data out into a separate table yielded some good speedups.
- 26:10 Some discussion of
pg_bench. - 35:30 How long does it take to add an index to large tables? They saw times of up to an hour for tables with millions of rows.
- 36:30 clustering your data in order to get PostgreSQL to write it more efficiently.
- 37:30-48:00 A thorough discussion of partitioning tables via table inheritance. They used an ActiveRecord model (39:23) with a bunch of utility methods. They also had a cron to periodically create new partitions. At 45:15 they make a nice distinction between using partial indexes and partitions - one advantage is that a partition's indexes can be different than its parents indexes. At 49:00 they mention maybe doing a plugin, not sure if that happened.
- 52:00 Some discussion of full text search via
tsearch. - 53:00 PostgreSQL's lack of built in replication outside of WAL shipping, Slony, etc. Thank goodness 9.0 will address this!
- 54:00 Some props to Engine Yard on their PostgreSQL support.
Good stuff all around, and thanks to Pivotal for posting these great talks!
24 Jul 2010 1:01am GMT
20 Jul 2010
Planet Ruby
Tomasz Wegrzanowski: We need syntax for talking about Ruby types
All this is about discussing types in blog posts, documentation etc. None of that goes anywhere near actual code (except possibly in comments). Ruby never sees that.
Statically typed languages have all this covered, and we need it too. Not static typing of course - just an expressive way to talk about what types things are - as plain English fails here very quickly. As far as I know nothing like that exists yet, so here's my proposal.
This system of type descriptions is meant for humans, not machines. It focuses on the most important distinctions, and ignores details that are not important, or very difficult to keep track of. Type descriptions should only be as specific as necessary in given context. If it makes sense, there rules should be violated.
In advance I'll say I totally ignored all the covariance / contravariance / invariance business - it's far to complicated, and getting too deeply into such issues makes little sense in a language where everything can be redefined.
Basic types
Types of simple values can be described by their class name, or any of its superclasses or mixins. So some ways to describe type of 15 would be Fixnum (actual class), Integer (superclass), Comparable (mixin), or Object (superclass all the way up).
In context of describing types, everything is considered an Object, and existence of Kernel, BasicObject etc. is ignored.
So far, it should all be rather obvious. Examples:
- 42 - Integer
- Time.now - Time
- Dir.glob("*") - Enumerable
- STDIN - IO
nil and other ignored issues
nil will be treated specially - as if it was of every possible type. nil means absence of value, and doesn't indicate what type the value would have if it was present. This is messy, but most explicitly typed languages follow this path.
Distinction between situations that allow nils and those that don't will be treated as all other value range restrictions (Integer must be posibile, IO must be open for writing etc.) - as something outside the type system.
For cases where nil means something magical, and not just absence of value, it should probably be mentioned.
Checked exceptions and related non-local exits in Ruby would be a hopeless thing to even attempt. There's syntax for exceptions and catches used as control structures if they're really necessary.
Booleans
We will also pretend that Boolean is a common superclass of TrueClass and FalseClass.
We will also normally ignore distinction between situations where real true/false are expected, and situations where any object goes, but acts identically to its boolean conversion. Any method that acts identically on x and !!x can be said to take Boolean.
On the other hand if some values are treated differently than their double negation, that's not really Boolean and it deserves a mention. Especially if nil and false are not equivalent - like in Rails's #in_groups_of (I don't think Ruby stdlib ever does thing like that).
Duck typing
If something quacks like a Duck convincingly enough, it can be said to be of type Duck, it being object's responsibility that its cover doesn't get blown.
In particular, Ruby uses certain methods for automatic type conversion. In many contexts objects implementing #to_str like Pathnames will be treated as Strings, objects implementing #to_ary as Arrays, #to_hash as Hashes, and to_proc as Procs - this can be used for some amazing things like Symbol#to_proc.
This leads to a big complication for us - C code implementing Ruby interpreter and many libraries is normally written in a way that calls these conversion functions automatically, so in such contexts Symbol really is a Proc, Pathname really is a String and so on. On the other hand, in Ruby code these methods are not magical, and such conversions will only happen if explicitly called - for them Pathname and String are completely unrelated types. Unless Ruby code calls C code, which then autoconverts.
Explicitly differentiating between contexts which expect a genuine String and those which expect either that or something with a valid #to_str method would be highly tedious, and I doubt anyone would get it exactly right.
My recommendation would be to treat everything that autoconverts to something as if it subclassed it. So we'll pretend Pathname is a subclass of String, even though it's not really. In some cases this will be wrong, but it's not really all that different from subclassing something and then introducing incompatible changes.
This all doesn't extend to #to_s, #to_a etc - nothing can be described as String just because it has to_s method - every object has to_s but most aren't really strings.
Technical explanation of to_str and friends
This section is unrelated to post's primary subject - skip if uninterested.
Ruby uses special memory layout for basic types like strings and arrays. Performance would be abysmal if string methods had to actually call Ruby code associated with whatever [] happened to be redefined to for every character - instead they ask for a certain C data structure, and access that directly (via some macros providing extra safety and convenience to be really exact).
By the way this is a great example of C being really slow - if Ruby was implemented on a platform with really good JIT, it could plausibly have every single string function implemented in term of calls to [], []=, size, and just a few others, with different subclasses of String providing different implementations, and JIT compiling inlining all that to make it really fast.
It would make it really simple to create class representing a text file, and =~ /regexp/ that directly without reading anything more than required to memory, or maybe even gsub! it in a way that would read it in small chunks, saving them to another file as soon as they're ready, and then renaming in one go. All that without regexp library knowing anything about it all. It's all just my fantasy, I'm not saying any such JIT actually exists.
Anyway, strings and such are implemented specially, but we still want these types to be real objects, not like what they've done in Java. To make it work, all C functions requiring access to underlying storage call a special macro which automatically calls a method like to_str or to_ary if necessary - so such objects can pretend to be strings very effectively. For example if you alias method to_str to path on File code like system File.open("/bin/hostname") will suddenly start working. It really makes sense only for things which are "essentially strings" like Pathname, URI, Unicode-enhanced strings, proxies for strings in third party libraries like Qt etc.
To complicate things further objects of all classes inheriting from String automatically use String's data representation - and C code will access that, never calling to_str. This leaves objects which duck type as Strings two choices:
- Subclass String and every time anything changes update C string data. This can be difficult - if you implement an URI and keep query part as a hash instance variable - you need to somehow make sure that your update code gets run every time that hash changes - like by not exposing it at all and only allowing query updates via your direct methods, or wrapping it in a special object that calls you back.
- Don't subclass String, define to_str the way you want. Everything works - except your class isn't technically a String so it's not terribly pretty OO design.
You probably won't be surprised that not subclassing is the more popular choice. As it's all due to technical limitations not design choices, it makes sense to treat such objects as if they were properly subclassed.
Collections
Back to the subject. For collections we often want to describe types of their elements. For simple collections yielding successive elements on #each, syntax for type description is CollectionType[MemberType]. Examples:
- [42.0, 17.5] - Array[Float]
- Set["foo","bar"] - Set[String]
- 5..10 - Range[Integer]
When we don't care about collection type, only about element types, descriptions like Enumerable[ElementType] will do.
Syntax for types of hashtables is Hash[KeyType, ValueType] - in general collections which yield multiple values to #each can be described as CollectionType[Type1, Type2, ..., TypeN].
For example {:foo => "bar"} is of type Hash[Symbol, String].
This is optional - type descriptions like Hash or Enumerable are perfectly valid - and often types are unrelated, or we don't care.
Not every Enumerable should be treated as collection of members like that - File might technically be File[String] but it's usually pointless to describe it this way. In 1.8 String is Enumerable, yielding successive lines when iterated - but String[String] make no sense (no longer a problem in 1.9).
Classes other than Enumerable like Delegator might need type parameters, and they should be specified with the same syntax. Their order and meaning depends on particular class, but usually should be obvious.
Literals and tuples
Ruby doesn't make distinction between Arrays and tuples. What I mean here is a kind of Array which shouldn't really be treated as a collection, and in which different members have unrelated type and meaning depending on their position.
Like method arguments. It really wouldn't be useful to say that every method takes Array[Object] (and an optional Proc) - types and meanings of elements in this array should be specified.
Syntax I want for this is [Type1, Type2, *TypeRest] - so for example Hash[Date, Integer]'s #select passes [Date, Integer] to the block, which should return a Boolean result, and then returns either Array[[Date, Integer]] (1.8) or Hash[Date, Integer] (1.9). Notice double [[]]s here - it's an Array of pairs. In many contexts Ruby automatically unpacks such tuples, so Array[[Date,Integer]] can often be treated as Array[Date,Integer] - but it doesn't go deeper than one level, and if you need this distinction it's available.
Extra arguments can be specified with *Type or ... which is treated here as *Object. If you want to specify some arguments as optional suffix their types with ? (the most obvious [] having too many uses already, and = not really fitting right).
In this syntax [*Foo] is pretty much equivalent to Array[Foo], or possibly Enumerable[Foo] (with some duck typing) - feel free to use that if it makes things clearer.
Basic literals like true, false, nil stand for themselves - and for entire TrueClass, FalseClass, NilClass classes too as they're their only members. Other literals such as symbols, strings, numbers etc. can be used too when needed.
To describe keyword arguments and hashes used in similar way, syntax is {Key1=>Type1, Key2=>Type2} - specifying exact key, and type of value like {:noop=>Boolean, :force=>Boolean}.
It should be assumed that keys other than those listed are ignored, cause exception, or are otherwise not supported. If they're meaningful it should be marked with ... like this {:query=>String, ...}. Subclasses often add extra keyword arguments, and this issue is ignored.
Functions
Everything so far was just a prelude to the most important part of any type system - types for functions. Syntax I'd propose it: ArgumentTypes -> ReturnType (=> being already used by hashes).
I cannot decide if blocks should be specified in Ruby-style notation or a function notation, so both & {|BlockArgumentTypes| BlockReturnType} and &(BlockArgumentTypes->BlockReturnType) are valid. & is necessary, as block are passed separately from normal arguments, however strong the temptation to reuse -> and let the context disambiguate might be.
Blocks that don't take any arguments or don't return anything can drop that part, leaving only something like &{|X|}, &{Y}, &{}, or in more functional notation &(X->), &(Y), &().
Because of all the [] unpacking, using [] around arguments, tuple return values etc. is optional - and just like in Ruby () can be used instead in such contexts.
If function doesn't take any arguments, or returns no values, these parts can be left - leaving perhaps as little as ->.
Examples:
- In context of %w[Hello world !].group_by(&:size) method #group_by has type Array[String]&{|String| Integer}->Hash[Integer,String]
- Time.at has type Numeric -> Time
- String#tr has type [String, String] -> String
- On a collection of Floats, #find would have type Float?&(Float->Boolean)->Float
- Function which takes no arguments and returns no values has type []->nil
If you really need to specify exceptions and throws, you can add raises Type, or throws :kind after return value. Use only for control structure exceptions, not for actual errors exceptions. It might actually be useful if actual data gets passed around.
- Find.find has type [String*]&(String->nil throws :prune)->nil
A standalone Proc can be described as (ArgumentsTypes->ReturnType) just as with notation for functions. There is no ambiguity between Proc arguments and block arguments, as blocks are always marked with |.
Type variable and everything else
In addition to names of real classes, any name starting with an uppercase letter should be consider a type. Unless it's specified otherwise in context, all such unknown names should be considered class variables with big forall quantifier in front of it all.
Examples:
- Enumerable[A]#partition has type &(B->Boolean)->[Array[A], Array[A]]
- Hash[A,B]#merge has type Hash[A,B]&(A,B,B->B)->Hash[A,B]
- Array[A]#inject has either type B&(B,A->B)->B or &(A,A)->A. This isn't just a usual case of missing argument being substituted by nil - these are two completely different functions.
To specify that multiple types are allowed (usually implying that behaviour will be different, otherwise there should be a superclass somewhere, or we could treat it as common duck typing and ignore it) join them with |. If there's ambiguity between this use and block arguments, parenthesize. It binds more tightly than ,, so it only applies to one argument. Example:
- String#index in 1.8 has type (String|Integer|Regexp, Integer?)->Integer (and notice how I ignored Fixnums here).
For functions that can be called in multiple unrelated ways, just list them separately - | and parentheses will work, but they are usually top level, and not needed anywhere deeper.
If you want to specify type of self, prefix function specification with Type#:
- #sort has type like Enumerable[A]#()&(A,A->1|0|-1)->Array[A]
To specify that something takes range of values not really corresponding to a Ruby class, just define such extra names somewhere and then use like this:
- File#chown has type (UnixUserId, UnixUserId)->0 - with UnixUserId being a pretend subclass of Integer, and 0 is literal value actually returned.
To specify that something needs a particular methods just make up a pretend mixin like Meowable for #meow.
Any obvious extensions to this notation can be used, like this:
- Enumerable[A]#zip has type (Enumerable[B_1], *Enumerable[B_i])->Array[A, B_1, *B_i] - with intention that B_is will be different for each argument understood from context. (I don't think any static type system handles cases like this one reasonably - most require separate case for each supported tuple length, and you cannot use arrays if you mix types. Am I missing something?)
The End
Well, what I really wanted to do what talk about Ruby collection system, and how 1.9 doesn't go far enough in its attempts at fixing it. And without notation for types talking about high order functions that operate on collections quickly turns into a horrible mess. So I started with a brief explanation of notation I wanted to use, and then I figured out I can as well do it right and write something that will be reusable in other contexts too.
Most discussion of type systems concerns issues like safety and flexibility, which don't concern me at all, and limit themselves to type systems usable by machines.
I need types for something else - as statements about data flow. Type signature like Enumerable[A]#()&(A->B)->Hash[A,B] doesn't tell you exactly what such function does but narrows set of possibilities extremely quickly. What it describes is a function which iterates over collection in order while building a Hash to be returned, using collection's elements as keys, and values returned by the block as values. Can you guess the function I was thinking about here?
Now a type like that is not a complete specification - a function that returns an empty hash fits it. As does one which skips every 5th element. And one that only keeps entries with unique block results. And for that matter also one that sends your email password to NSA - at least assuming it returns that Hash afterwards.
It was still pretty useful. How about some of those?
- Hash[A,B] -> Hash[B, Array[A]]
- Hash[A,B] &(A,B->C) -> Hash[A,C]
- Hash[A, Hash[B,C]] -> Hash[[A,B], C]
- Hash[A,B] &(A,B->C) -> Hash[C, Hash[A,B]]
- Enumerable[Hash[A,B]] &(A,B,B->B) -> Hash[A,B]
- Hash[A,Set[B]] -> Hash[Set[A], Set[B]]
Even these short snippets should give a pretty good idea what these are all about.
That's it for now. Hopefully it won't be long until that promised 1.9 collections post.
20 Jul 2010 8:27pm GMT
19 Jul 2010
Planet Ruby
Charles Oliver Nutter: What JRuby C Extension Support Means to You
As part of the Ruby Summer of Code, Tim Felgentreff has been building out C extension support for JRuby. He's already made great progress, with simple libraries like Thin and Mongrel working now and larger libraries like RMagick and Yajl starting to function. And we haven't even reached the mid-term evaluation yet. I'd say he gets an "A" so far.
I figured it was time I talked a bit about C extensions, what they mean (or don't mean) for JRuby, and how you can help.
The Promise of C Extensions
One of the "last mile" features keeping people from migrating to JRuby has been their dependence on C extensions that only work on regular Ruby. In some cases, these extensions have been written to improve performance, like the various json libraries. Some of that performance could be less of a concern under Ruby 1.9, but it's hard to claim that any implementation will be able to run Ruby as fast as C for general-purpose libraries any time soon.
However, a large number of extensions - perhaps a majority of extensions - exist only to wrap a well-known and well-trusted C library. Nokogiri, for example, wraps the excellent libxml. RMagick wraps ImageMagick. For these cases, there's no alternative on regular Ruby...it's the C library or nothing (or in the case of Nokogiri, your alternatives are only slow and buggy pure-Ruby XML libraries).
For the performance case, C extensions on JRuby don't mean a whole lot. In most cases, it would be easier and just as performant to write that code in Java, and many pure-Ruby libraries perform well enough to reduce the need for native code. In addition, there are often libraries that already do what the perf-driven extensions were written for, and it's trivial to just call those libraries directly from Ruby code.
But the library case is a bit stickier. Nokogiri does have an FFI version, but it's a maintenance headache for them and a bug report headache for us, due to the lack of a C compiler tying the two halves together. There's a pure-Java Nokogiri in progress, but building both the Ruby bindings and emulating libxml behavior takes a long time to get right. For libraries like RMagick or the native MySQL and SQLite drivers, there are basically no options on the JVM. The Google Summer of Code project RMagick4J, by Sergio Arbeo, was a monumental effort that still has a lot of work left to be done. JDBC libraries work for databases, but they provide a very different interface from the native drivers and don't support things like UNIX domain sockets.
There's a very good chance that JRuby C extension support won't perform as well as C extensions on C Ruby, but in many cases that won't matter. Where there's no equivalent library now, having something that's only 5-10x slower to call - but still runs fast and matches API - may be just fine. Think about the coarse-grained operations you feed to a MySQL or SQLite and you get the picture.
So ultimately, I think C extensions will be a good thing for JRuby, even if they only serve as a stopgap measure to help people migrate small applications over to native Java equivalents. Why should the end goal be native Java equivalents, you ask?
The Peril of C Extensions
Now that we're done with the happy, glowing discussion of how great C extension support will be, I can make a confession: I hate C extensions. No feature of C Ruby has done more to hold it back than the desire for backward compatibility with C extensions. Because they have direct pointer access, there's no easy way to build a better garbage collector or easily support multiple runtimes in the same VM, even though various research efforts have tried. I've talked with Koichi Sasada, the creator of Ruby 1.9's "YARV" VM, and there's many things he would have liked to do with YARV that he couldn't because of C extension backward compatibility.
For JRuby, supporting C extensions will limit many features that make JRuby compelling in the first place. For example, because C extensions often use a lot of global variables, you can't use them from multiple JRuby runtimes in the same process. Because they expect a Ruby-like threading model, we need to restrict concurrency when calling out from Java to C. And all the great memory tooling I've blogged about recently won't see C extensions or the libraries they call, so it introduces an unknown.
All that said, I think it's a good milestone to show that we can support C extensions, and it may make for a "better JNI" for people who really just want to write C or who simply need to wrap a native library.
How You Can Help
There's a few things I think users like you can help with.
First off, we'd love to know what extensions you are using today, so we can explore what it would take to run them under JRuby (and so we can start exploring pure-Java alternatives, too.) Post your list in the comments, and we'll see what we can come up with.
Second, anyone that knows C and the Ruby C API (like folks who work on extensions) could help us fill out bits and pieces that are missing. Set up the JRuby cext branch (I'll show you how in a moment), and try to get your extensions to build and load. Tim has already done the heavy lifting of making "gem install xyz" attempt to build the extension and "require 'xyz'" try to load the resulting native library, so you can follow the usual processes (including extconf.rb/mkmf.rb for non-gem building and testing.) If it doesn't build ok, help us figure out what's missing or incorrect. If it builds but doesn't run, help us figure out what it's doing incorrectly.
Building JRuby with C Extension Support
Like building JRuby proper, building the cext work is probably the easiest thing you'll do all day (assuming the C compiler/build/toolchain doesn't bite you.
- Check out (or fork and check out) the JRuby repository from http://github.com/jruby/jruby:
git clone git://github.com/jruby/jruby.git
- Switch to the "cext" branch:
git checkout -b cext origin/cext
- Do a clean build of JRuby plus the cext subsystem:
ant clean build-jruby-cext-native
At this point you should have a JRuby build (run with bin/jruby) that can gem install and load native extensions.
19 Jul 2010 8:08pm GMT
Charles Oliver Nutter: Browsing Memory with Ruby and Java Debug Interface
This is the third post in a series. The first two were on Browsing Memory the JRuby Way and Finding Leaks in Ruby Apps with Eclipse Memory Analyzer
Hello again, friends! I'm back with more exciting memory analysis tips and tricks! Ready? Here we go!
After my previous two posts, several folks asked if it's possible to do all this stuff from Ruby, rather than using Java or C-based apps shipped with the JVM. The answer is yes! Because of the maturity of the Java platform, there are standard Java APIs you can use to access all the same information the previous tools consumed. And since we're talking about JRuby, that means you have Ruby APIs you can use to access that information.
That's what I'm going to show you today.
Introducing JDI
The APIs we'll be using are part of the Java Debug Interface (JDI), a set of Java APIs for remotely inspecting a running application. It's part of the Java Platform Debugger Architecture, which also includes a C/++ API, a wire protocol, and a raw wire protocol API. Exploring those is left as an exercise for the reader...but they're also pretty cool.
We'll use the Rails app from before, inspecting it immediately after boot. JDI provides a number of ways to connect up to a running VM, using VirtualMachineManager; you can either have the debugger make the connection or the target VM make the connection, and optionally have the target VM launch the debugger or the debugger launch the target VM. For our example, we'll have the debugger attach to a target VM listening for connections.
Preparing the Target VM
The first step is to start up the application with the appropriate debugger endpoint installed. This new flag is a bit of a mouthful (and we should make a standard flag for JRuby users), but we're simply setting up a socket-based listener on port 12345, running as a server, and we don't want to suspend the JVM when the debugger connects.
jruby -J-agentlib:jdwp=transport=dt_socket,server=y,address=12345,suspend=n -J-Djruby.reify.classes=true script/server -e production
The -J-Djruby.reify.classes bit I talked about in my first post. It makes Ruby classes show up as Java classes for purposes of heap inspection.
The rest is just running the server in production mode.
As you can see, remote debugging is already baked into the JVM, which means we didn't have to write it or debug it. And that's pretty awesome.
Let's connect to our Rails process and see what we can do.
Connecting to the target VM
In order to connect to the target VM, you need to do the Java factory dance. We start with the com.sun.jdi.Bootstrap class, get a com.sun.jdi.VirtualMachineManager, and then connect to a target VM to get a com.sun.jdi.VirtualMachine object.
vmm = Bootstrap.virtual_machine_manager
sock_conn = vmm.attaching_connectors[0] # not guaranteed to be Socket
args = sock_conn.default_arguments
args['hostname].value = "localhost"
args['port'].value = "12345"
vm = sock_conn.attach(args)
Notice that I didn't dig out the socket connector explicitly here, because on my system, the first connector always appears to be the socket connector. Here's the full list for me on OS X:
➔ jruby -rjava -e "puts com.sun.jdi.Bootstrap.virtual_machine_manager.attaching_connectors
> "
[com.sun.jdi.SocketAttach (defaults: timeout=, hostname=charles-nutters-macbook-pro.local, port=),
com.sun.jdi.ProcessAttach (defaults: pid=, timeout=)]
The ProcessAttach connector there isn't as magical as it looks; all it does is query the target process to find out what transport it's using (dt_socket in our case) and then calls the right connector (e.g. SocketAttach in the case of dt_socket or SharedMemoryAttach if you use dt_shmem on Windows). In our case, we know it's listening on a socket, so we're using the SocketAttach connector directly.
The rest is pretty simple: we get the default arguments from the connector, twiddle them to have the right hostname and port number, and attach to the VM. Now we have a VirtualMachine object we can query and twiddle; we're inside the matrix.
With Great Power...
So, what can we do with this VirtualMachine object? We can:
- walk all classes and objects on the heap
- install breakpoints and step-debug any running code
- inspect and modify the current state of any running thread, even manipulating in-flight arguments and variables
- replace already-loaded classes with new definitions (such as to install custom instrumentation)
Here's the output from JRuby's ri command when we ask about VirtualMachine:
➔ ri --java com.sun.jdi.VirtualMachine
-------------------------------------- Class: com.sun.jdi.VirtualMachine
(no description...)
------------------------------------------------------------------------
Instance methods:
-----------------
allClasses, allThreads, canAddMethod, canBeModified,
canForceEarlyReturn, canGetBytecodes, canGetClassFileVersion,
canGetConstantPool, canGetCurrentContendedMonitor,
canGetInstanceInfo, canGetMethodReturnValues,
canGetMonitorFrameInfo, canGetMonitorInfo, canGetOwnedMonitorInfo,
canGetSourceDebugExtension, canGetSyntheticAttribute, canPopFrames,
canRedefineClasses, canRequestMonitorEvents,
canRequestVMDeathEvent, canUnrestrictedlyRedefineClasses,
canUseInstanceFilters, canUseSourceNameFilters,
canWatchFieldAccess, canWatchFieldModification, classesByName,
description, dispose, eventQueue, eventRequestManager, exit,
getDefaultStratum, instanceCounts, mirrorOf, mirrorOfVoid, name,
process, redefineClasses, resume, setDebugTraceMode,
setDefaultStratum, suspend, toString, topLevelThreadGroups,
version, virtualMachine
We can basically make the target VM dance any way we want, even going so far as to write our own debugger entirely in Ruby code. But that's a topic for another day. Right now, we're going to do some memory inspection.
Creating a Histogram of the Heap
The simplest heap inspection we might do is to produce a histogram of all objects on the heap. And as you might expect, this is one of the easiest things to do, because it's the first thing everyone looks for when debugging a memory issue.
classes = VM.all_classes
counts = VM.instance_counts(classes)
classes.zip(counts)
VirtualMachine.all_classes gives you a list (a java.util.List, but we make those behave mostly like a Ruby Array) of every class the JVM has loaded, including Ruby classes, JRuby core and runtime classes, and other Java classes that JRuby and the JVM use. VirtualMachine.instance_counts takes that list of classes and returns another list of instance counts. Zip the two together, and we have an array of classes and instance counts. So easy!
Let's take these two pieces and put them together in an easy-to-use class
require 'java'
module JRuby
class Debugger
VMM = com.sun.jdi.Bootstrap.virtual_machine_manager
attr_accessor :vm
def initialize(options = {})
connectors = VMM.attaching_connectors
if options[:port]
connector = connectors.find {|ac| ac.name =~ /Socket/}
elsif options[:pid]
connector = connectors.find {|ac| ac.name =~ /Process/}
end
args = connector.default_arguments
for k, v in options
args[k.to_s].value = v.to_s
end
@vm = connector.attach(args)
end
# Generate a histogram of all classes in the system
def histogram
classes = @vm.all_classes
counts = @vm.instance_counts(classes)
classes.zip(counts)
end
end
end
I've taken the liberty of expanding the connection process to handle pids and other arguments passed in. So to get a histogram from a VM listening on localhost port 12345, we can simply do:
JRuby::Debugger.new(:hostname => 'localhost', :port => 12345).histogram
Now of course this list is going to have a lot of JRuby and Java objects that we might not be interested in, so we'll want to filter it to just the Ruby classes. On JRuby master, all the generated Ruby classes start with a package name "ruby". Unfortunately, jitted Ruby methods start with a package of "ruby.jit" right now, so we'll want to filter those out too (unless you're interested in them, of course...JRuby is an open book!)
require 'jruby_debugger'
# connect to the VM
debugr = JRuby::Debugger.new(:hostname => 'localhost', :port => 12345)
histo = debugr.histogram
# sort by count
histo.sort! {|a,b| b[1] => a[1]}
# filter to only user-created Ruby classes with >0 instances
histo.each do |cls,num|
next if num == 0 || cls.name[0..4] != 'ruby.' || cls.name[5..7] == 'jit'
puts "#{num} instances of #{cls.name[5..-1].gsub('.', '::')}"
end
If we run this short script against our Rails application, we see similar results to the previous posts (but it's cooler, because we're doing it all from Ruby!)
➔ jruby ruby_histogram.rb | head -10
11685 instances of TZInfo::TimezoneTransitionInfo
1071 instances of Gem::Version
1012 instances of Gem::Requirement
592 instances of TZInfo::TimezoneOffsetInfo
432 instances of Gem::Dependency
289 instances of Gem::Specification
142 instances of ActiveSupport::TimeZone
118 instances of TZInfo::DataTimezoneInfo
118 instances of TZInfo::DataTimezone
45 instances of Gem::Platform
Just so we're all on the same page, it's important to know what we're actually dealing with here. VirtualMachine.all_classes returns a list of com.sun.jdi.ReferenceType objects. Let's ri that.
➔ ri --java com.sun.jdi.ReferenceType
--------------------------------------- Class: com.sun.jdi.ReferenceType
(no description...)
------------------------------------------------------------------------
Instance methods:
-----------------
allFields, allLineLocations, allMethods, availableStrata,
classLoader, classObject, compareTo, constantPool,
constantPoolCount, defaultStratum, equals, failedToInitialize,
fieldByName, fields, genericSignature, getValue, getValues,
hashCode, instances, isAbstract, isFinal, isInitialized,
isPackagePrivate, isPrepared, isPrivate, isProtected, isPublic,
isStatic, isVerified, locationsOfLine, majorVersion, methods,
methodsByName, minorVersion, modifiers, name, nestedTypes,
signature, sourceDebugExtension, sourceName, sourceNames,
sourcePaths, toString, virtualMachine, visibleFields,
visibleMethods
You can see there's quite a bit more you can do with a ReferenceType. Let's try something.
Digging Deeper Into TimezoneTransitionInfo
Let's actually take some time to explore our old friend TimezoneTransitionInfo (hereafter referred to as TTI). Instead of walking all classes in the system, we'll want to just grab TTI directly. For that we use VirtualMachine.classes_by_name, which returns a list of classes on the target VM of that name. There should be only one, since we only have a single JRuby instance in our server, so we'll grab that class and request exactly one instance of it...any old instance.
tti_class = debugr.vm.classes_by_name('ruby.TZInfo.TimezoneTransitionInfo')[0]
tti_obj = tti_class.instances(1)[0]
puts tti_obj
Running this we can see we've got the reference we're looking for.
➔ jruby tti_digger.rb
instance of ruby.TZInfo.TimezoneTransitionInfo(id=2)
ReferenceType.instances returns a list (no larger than the specified size, or all instances if you specify 0) of com.sun.jdi.ObjectReference objects.
➔ ri --java com.sun.jdi.ObjectReference
------------------------------------- Class: com.sun.jdi.ObjectReference
(no description...)
------------------------------------------------------------------------
Instance methods:
-----------------
disableCollection, enableCollection, entryCount, equals, getValue,
getValues, hashCode, invokeMethod, isCollected, owningThread,
referenceType, referringObjects, setValue, toString, type,
uniqueID, virtualMachine, waitingThreads
Among the weirder things like disabling garbage collection for this object or listing all threads waiting on this object's monitor (a la 'synchronize' in Java), we can access the object's fields through getValue and setValue.
Let's examine the instance variables TTI contains. You may recall from previous posts that all Ruby objects in JRuby store their instance variables in an array, to avoid the large memory and cpu cost of storing them in a map. We can grab a reference to that array and display its contents.
var_table_field = tti_class.field_by_name('varTable')
tti_vars = tti_obj.get_value(var_table_field)
puts "varTable: #{tti_vars}"
puts tti_vars.values.map(&:to_s)
And the new output:
➔ jruby tti_digger.rb
varTable: instance of java.lang.Object[7] (id=13)
instance of ruby.TZInfo.TimezoneOffsetInfo(id=15)
instance of ruby.TZInfo.TimezoneOffsetInfo(id=16)
instance of org.jruby.RubyFixnum(id=17)
instance of org.jruby.RubyFixnum(id=18)
instance of org.jruby.RubyNil(id=19)
instance of org.jruby.RubyNil(id=19)
instance of org.jruby.RubyNil(id=19)
Since the varTable field is a simple Object[] in Java, the reference we get to it is of type com.sun.jdi.ArrayReference.
➔ ri --java com.sun.jdi.ArrayReference
-------------------------------------- Class: com.sun.jdi.ArrayReference
(no description...)
------------------------------------------------------------------------
Instance methods:
-----------------
disableCollection, enableCollection, entryCount, equals, getValue,
getValues, hashCode, invokeMethod, isCollected, length,
owningThread, referenceType, referringObjects, setValue, setValues,
toString, type, uniqueID, virtualMachine, waitingThreads
Of course each of these references can be further explored, but already we can see that this TTI instance has seven instance variables: two TimezoneOffsetInfo objects, two Fixnums, and three nils. But we don't have instance variable names!
Instance variable names are only stored on the object's class. There, a table of names to offsets is kept up-to-date as new instance variable names are discovered. We can access this from the TTI class reference and combine it with the variable table to get the output we want to see.
# get the metaclass object and class reference
metaclass_field = tti_class.field_by_name('metaClass')
tti_class_obj = tti_obj.get_value(metaclass_field)
tti_class_class = tti_class_obj.reference_type
# get the variable names from the metaclass object
var_names_field = tti_class_class.field_by_name('variableNames')
var_names = tti_class_obj.get_value(var_names_field)
# splice the names and values together
table = var_names.values.zip(tti_vars.values)
puts table
This looks a bit complicated, but there's actually a lot of boilerplate here we could put into a utility class. For example, the metaClass and variableNames fields are standard on all (J)Ruby objects and classes, respectively. But considering that we're actually walking a remote VM's *live* heap...this is pretty simple code.
Here's what our script outputs now:
➔ jruby tti_digger.rb
"@offset"
instance of ruby.TZInfo.TimezoneOffsetInfo(id=25)
"@previous_offset"
instance of ruby.TZInfo.TimezoneOffsetInfo(id=26)
"@numerator_or_time"
instance of org.jruby.RubyFixnum(id=27)
"@denominator"
instance of org.jruby.RubyFixnum(id=28)
"@at"
instance of org.jruby.RubyNil(id=29)
"@local_end"
instance of org.jruby.RubyNil(id=29)
"@local_start"
instance of org.jruby.RubyNil(id=29)
We could go even deeper, but I think you get the idea.
Your Turn
Here's a gist of the three scripts we've created, so you can refer to and build off of them. And of course the javadocs and ri docs will help you as well, plus everything we've done here you can do in a jirb session.
There's a lot to the JDI API, but once you've got the VirtualMachine object in hand it's pretty easy to follow. As you'd expect from any debugger API, you need to know a bit about how things work on the inside, but through the magic of JRuby it's actually possible to write most of those fancy memory and debugging tools entirely in Ruby. Perhaps this article has peaked your interest in exploring JRuby internals using JDI and you might start to write debugging tools. Perhaps we can ship a few utilities to make some of the boilerplate go away. In any case, I hope this series of articles shows that JRuby users have an amazing library of tools available to them, and you don't even have to leave your comfort zone if you don't want to.
Note: The variableNames field is a recent addition to JRuby master, so if you'd like to play with that you'll probably want to build JRuby yourself or wait for a nightly build that picks it up. But you can certainly do a lot of exploring even without that patch.
19 Jul 2010 4:01am GMT
18 Jul 2010
Planet Ruby
Tomasz Wegrzanowski: If only Ruby had macros
Blogger will most likely totally destroy code formatting again, sorry about that.
Ruby annoys me a lot - the code gets so close to being Just Right, with only that last little bit of wrongness that won't go away no matter what. With everything except Ruby at least I know it will be crap no matter what, so I never get this.
For example it's so easy to make a function generating successive values on each call:
def counter(v)
return counter(v, &:succ) unless block_given?
proc{ v = yield(v) }
end
But you must give it value before the first - and sometimes such a thing doesn't exist, like with generating successive labels "a", "b", "c" ... A counter starting from the first value passed isn't exactly difficult, it just doesn't feel right:
def counter(v)
return counter(v, &:succ) unless block_given?
proc{ old, v = v, yield(v); old }
end
Useless variables like old that only indicate control flow just annoy me. Not to mention lack of default block argument. I'm undecided if this tap makes things better or worse.
def counter(v)
return counter(v, &:succ) unless block_given?
proc{v.tap{ v = yield(v) }}
end
Another example. This wrapper for Ruby executable makes rubygems and -r compatible. It's so close to being able to use Array#map, and yet so far away:
args = []
while arg = ARGV.shift
if arg =~ /\A-r(.*)\z/
lib = $1.empty? ? ARGV.shift : $1
args << "-e" << "require 'rubygems'; require '#{lib}'"
else
args << arg
end
end
exec "ruby", *args
Yes, these are tiny things, but it's frustrating to get almost there. By the way, -r should just call require, another thing which is almost right but no.
I could go on with these small examples, but I want to talk about something bigger. A very common pattern in all programming languages is something like this:
collection.each{|item|
if item.test_1
item.action_1
elsif item.test_2
item.action_2
elsif item.test_3
item.action_3
else
item.otherwise
end
}
Or a very similar:
collection.each{|item|
case item
when pattern_1
item.action_1
when pattern_2
item.action_2
when pattern_3
item.action_3
else
item.otherwise
end
}
Tests and actions are all next to each other, where they belong. But what if instead of executing an action on a single item at a time, we wanted to do so on all matching items together?
If Ruby had proper macros it would be totally trivial - unfortunately Ruby forces us to choose one of bad options. First, the most straightforward:
yes1, no1 = collection.partition{|item| item.test_1}
yes2, no12 = no1.partition{|item| item.test_2}
yes3, no123 = no12.partition{|item| item.test_3}
yes_1.action_1
yes_2.action_2
yes_3.action_3
no123.otherwise
Rather awful. Or perhaps this?
groups = collection.group_by{|item|
if item.test_1 then 1
elsif item.test_2 then 2
elsif item.test_3 then 3
else 4
end
}
(groups[1]||[]).action_1
(groups[2]||[]).action_2
(groups[3]||[]).action_3
(groups[4]||[]).otherwise
By the way we cannot use a series of selects here - action_3 should apply only to items which pass test_3 but not test_1 or test_2.
We can imagine adding extra methods to Enumerable to get syntax like this:
collection.run_for_each_group(
proc{|item| item.test_1}, proc{|group| group.action_1},
proc{|item| item.test_2}, proc{|group| group.action_2},
proc{|item| item.test_3}, proc{|group| group.action_3},
proc{|group| group.otherwise})
Or maybe like this (looks even worse if you need to assign groups to a variable before performing the relevant action):
tmp = collection.dup
tmp.destructive_select!{|item| item.test_1}.action_1
tmp.destructive_select!{|item| item.test_2}.action_2
tmp.destructive_select!{|item| item.test_3}.action_3
tmp.otherwise
#destructive_select! being a method in style of Perl's splice - removing some items from collection, and returning removed values.
Possibly wrapping it in something like:
collection.filter{|item| item.test_1}.action{|group| group.action_1}.
.filter{|item| item.test_2}.action{|group| group.action_2}.
.filter{|item| item.test_3}.action{|group| group.action_3}.
.action{|group| group.otherwise}
A few more bad ideas (David Allen says the way you can tell a highly creative person is that they generate bad ideas faster than anyone else). With instance_eval we could do something like this, with item and group being appropriate method calls.
collection.run_for_each_group{
rule{ item.test_1 }
action{ group.action_1 }
rule{ item.test_2 }
action{ group.action_2 }
rule{ item.test_3 }
action{ group.action_3 }
action{ group.otherwise }
}
It would be pretty hard to do that while still being able to have inner blocks with your current object's context. By the way trying this out I found out that it's impossible to call a block specifying self, and call a block passing arguments at the same time - it's only one or the other - and no combination of the two makes it work. Those tiny limitations are just infuriating.
I also tried overriding ===. Now that would only work for a small subset of cases but was worth a try:
collection.run_for_each_group{|item, group|
case item
when pattern_1
group.action_1
when pattern_2
group.action_2
when pattern_3
group.action_3
else
group.otherwise
end
}
This item would actually be a special object, calling === on which would callcc, partition collection in two, and resume twice modifying group variable (initially set to the entire collection). That would be pretty cool - except Ruby doesn't use double dispatch, so === is not a CLOS style generic function - it's a method, set on pattern objects, and while adding new pattern types is easy, making old patterns match new kinds of objects is hard. It would require manually finding out every pattern, and manually overriding it to handle our magic item type - and then a lot of hackery to make Regexp#=== work, and then it would fail anyway, as Range#=== and such seem to be handled specially by Ruby.
There was a related possibility of not doing anything weird to item, but requiring special patterns:
collection.run_for_each_group{|item, group, all|
case item
when all[pattern_1]
group.action_1
when all[pattern_2]
group.action_2
when all[pattern_3]
group.action_3
else
group.otherwise
end
}
We're not actually using item here all, so we don't really need to pass it:
collection.run_for_each_group{|group, all|
if all[pattern_1]
group.action_1
elsif all[pattern_2]
group.action_2
elsif all[pattern_3]
group.action_3
else
group.otherwise
end
}
Totally implementable, only somewhat ugly with all these all[]s. There are two good ways to implement it - all function would test all items, and if all returned the same value it would just return. Otherwise, it would divide the collection, and in one implementation use callcc, or in alternative implementation, throw something, and restart the whole block twice - this assumes tests are cheap and deterministic.
It looks good, but it doesn't make me happy, as I want all kinds of tests, not just pattern matches. And eventually I came up with this:
collection.run_for_each_group{|item, group, all|
if all[item.test_1]
group.action_1
elsif all[item.test_2]
group.action_2
elsif all[item.test_3]
group.action_3
else
group.otherwise
end
}
This way, you can do any test on item you want - just pass the result to all[] before proceeding.
How is it implemented? I could callcc for every element, but unlike Scheme's, Ruby's callcc is rather expensive. And not every version of Ruby has it. So it's the naive throw-and-restart-twice instead. This means tests on each item can be rerun many times, so they better be cheap. Determinism is also advised, even though my implementation caches the first value returned to avoid troubles.
Well, first some usage example you can actually run:
require "pathname"
files = Pathname("/etc").children
files.run_for_each_group{|x,xs,all|
if all[x.directory?]
puts "Subdirectories: #{xs*' '}"
elsif all[x.symlink?]
puts "Symlinks: #{xs*' '}"
elsif all[x.size > 2**16]
puts "Big files: #{xs*' '}"
else
puts "The rest: #{xs.size} files"
end
}
Doesn't it look a lot lot better than a long cascade of #partitions?
And now #run_for_in_group:
module Enumerable
def run_for_each_group(expected=[], &blk)
return if empty?
xst, xsf = [], []
each{|it|
answers = expected.dup
catch :item_tested do
yield(it, self, proc{|v|
if answers.empty?
(v ? xst : xsf) << it
throw :item_tested
end
answers.pop
})
return
end
}
xst.run_for_each_group([true, *expected], &blk)
xsf.run_for_each_group([false, *expected], &blk)
end
end
It shouldn't be that difficult to understand. expected tracks the list of expected test results for all items in current collection. Now we iterate, passing each element, the entire group, and all callback function.
The first few times all is called, it just returns recorded answers - they're the same for every element. If after all recorded answers all is called again - we record its result, throw out of the block, and rerun it twice with expanded expectations.
On the other hand if we didn't get any calls to all other than those already recorded, it means we reached the action - group it sees is every element with the same test history. This must only happen once for group, so we return from function.
Total number of block calls is - 1x for each action, 2x for directories, 3x for symlinks, 4x for big files, and also 4x for everything else. Avoiding these reruns would be totally possible with callcc - but it's rather ugly, and often these tests aren't an issue.
So problem solved? Not really. I keep finding myself in situations where a new control structure would make a big difference, and there just doesn't seem to be any way of making it work in Ruby without enough boilerplate code to make it not worthwhile.
I'll end this post with some snippets of code which are just not quite right. Any ideas for making them suck less?
urls = Hash[file.map{|line| id, url = line.split; [id.to_i, url]}]
each_event{|type, *args|
case type
when :foo
one, two = *args
# ...
when :bar
one, = *args
# ...
end
}
if dir
Dir.chdir(dir){ yield(x) }
else
yield(x)
end
18 Jul 2010 8:22am GMT
17 Jul 2010
Planet Ruby
Tomasz Wegrzanowski: Another example of Ruby being awesome - %W
And there I was thinking I knew everything about Ruby, at least as far as its syntax goes...
As you might have figured out from my previous posts, I'm totally obsessed about string escaping hygiene - I would never send "SELECT * FROM reasons_why_mysql_sucks WHERE reason_id = #{id}" to an sql server even if I was absolutely totally certain that id is a valid integer and nothing can possibly go wrong here. Sure, I might be right 99% of time, but it only takes a single such mistake to screw up the system. And not only with SQL - it's the same with generated HTML, generated shell commands and so on.
And speaking of shell commands - system function accepts either a string which it then evaluates according to shell rules (big red flag), or a list of arguments which it uses to fork+exec right away. Of course we want to do that - except it's really goddamn ugly. Faced with a choice between this insecure but reasonably looking way of starting MongoDB shard servers:
system "mongod --shardsvr --port '#{port}' --fork --dbpath '#{data_dir}' \
--logappend --logpath '#{logpath}' --directoryperdb"
And this secure but godawful (to_s is necessary as port is an integer, and system won't take that):
system *["mongod", "--shardsvr", "--port", port, "--fork",
"--dbpath", data_dir, "--logappend",
"--logpath", logpath, "--directoryperdb"].map(&:to_s)
Even I have my doubts.
And then I found something really cool in Ruby syntax that totally solves the problem. Now I was totally aware of %w[foo bar] syntax Ruby copied from Perl's qw[foo bar], and while useful occasionally, is really little more than constructing a string, and then calling #split on that.
And I though I was also aware of %W - which obviously would work just like %w except evaluating code inside. Except that's not what it does! %W[foo #{bar}] is not "foo #{bar}".split - it's ["foo", "#{bar}"]! And using a real parser of course, so you can use as many spaces inside that code block as you want.
system *%W[mongod --shardsvr --port #{port} --fork --dbpath #{data_dir}
--logappend --logpath #{logpath} --directoryperdb]
There's nothing in Perl able to do that. Not only it's totally secure, it looks every better than the original insecure version as you don't need to insert all those 's around arguments (which only half-protected them anyway, but were better than nothing), and you can break it into multiple lines without \s.
%W always does the right thing - %W[scp #{local_path} #{user}@#{host}:#{remote_path}] will keep the whole remote address together - and if the code block returns an empty string or nil, you'll get an empty string there in the resulting array. I sort of wish there was some way of adding extra arguments with *args-like syntax like in other contexts, but %W[...] + args does exactly that, so it's not a big deal.
By the way, it seems to me that all % constructors undeservingly get a really bad reputation as some sort of ugly Perl leftover in Ruby community. This is so wrong - what's ugly is excessive escaping with \ which they help avoid. Which regexp for Ruby executables looks less bad, the one with way too many \/s - /\A(\/usr|\/usr\/local|\/opt|)\/bin\/j?ruby[\d.]*\z/, or one which avoids them all thanks to %r - %r[\A(/usr|/usr/local|/opt|)/bin/j?ruby[\d.]*\z]?
By the way - yes I used []s inside even though they were the big demarcator. That's another great beauty of % constructions - if you demarcate with some sort of braces like [], (), <>, or {} - it will only close once every matched pair inside is closed - so unlike traditional singly and doubly quoted strings % can be nested infinitely deep without a single escape character! (Perl could do that one as well)
And speaking of things that Ruby copied from Perl, and then made them much more awesome, here's a one-liner to truncate a bunch of files after 10 lines, with optional backups. Which language gets even close to matching that? ($. in both Perl and Ruby will keep increasing from file to file, so you cannot use that)
ruby -i.bak -ple 'ARGF.skip if ARGF.file.lineno > 10' files*.txt
17 Jul 2010 9:22pm GMT
16 Jul 2010
Planet Ruby
Nick Sieger: JRuby and Rails 3, Sitting in a Tree
Synopsis
jruby -S rails new myapp -m http://jruby.org/rails3.rb
When creating your Rails 3 application, just add the JRuby-specific template (-m http://jruby.org/rails3.rb).
Details
$ jruby -S gem install rails --pre --no-rdoc --no-ri
Due to a rubygems bug, you must uninstall all older versions of bundler for 0.9 to work
Successfully installed i18n-0.3.3
Successfully installed tzinfo-0.3.16
Successfully installed builder-2.1.2
Successfully installed memcache-client-1.7.8
Successfully installed activesupport-3.0.0.beta
Successfully installed activemodel-3.0.0.beta
Successfully installed rack-1.1.0
Successfully installed rack-test-0.5.3
Successfully installed rack-mount-0.4.7
Successfully installed abstract-1.0.0
Successfully installed erubis-2.6.5
Successfully installed actionpack-3.0.0.beta
Successfully installed arel-0.2.1
Successfully installed activerecord-3.0.0.beta
Successfully installed activeresource-3.0.0.beta
Successfully installed mime-types-1.16
Successfully installed mail-2.1.3
Successfully installed text-hyphen-1.0.0
Successfully installed text-format-1.0.0
Successfully installed actionmailer-3.0.0.beta
Successfully installed thor-0.13.3
Successfully installed railties-3.0.0.beta
Successfully installed bundler-0.9.7
Successfully installed rails-3.0.0.beta
24 gems installed
And:
$ jruby -S gem install activerecord-jdbcsqlite3-adapter --no-rdoc --no-ri
Successfully installed activerecord-jdbc-adapter-0.9.3-java
Successfully installed jdbc-sqlite3-3.6.3.054
Successfully installed activerecord-jdbcsqlite3-adapter-0.9.3-java
3 gems installed
Finally:
$ jruby -S rails new myapp -m http://jruby.org/rails3.rb
create
...(app creation)...
apply http://jruby.org/rails3.rb
apply http://jruby.org/templates/default.rb
gsub Gemfile
run jruby script/rails generate jdbc from "."
...(warnings omitted)...
exist
create config/initializers/jdbc.rb
create lib/tasks/jdbc.rake
$ cd myapp
$ jruby script/rails server
...(warnings omitted)...
=> Booting WEBrick
=> Rails 3.0.0.beta application starting in development on http://0.0.0.0:3000
=> Call with -d to detach
=> Ctrl-C to shutdown server
[2010-02-23 19:44:26] INFO WEBrick 1.3.1
[2010-02-23 19:44:26] INFO ruby 1.8.7 (2010-02-23) [java]
[2010-02-23 19:44:26] INFO WEBrick::HTTPServer#start: pid=16449 port=3000

Recap
You'll have best results with JRuby 1.5 snapshots, which include RubyGems 1.3.6. JRuby 1.5 final is coming soon. Also, the new activerecord-jdbc-adapter 0.9.3 release is required for Rails 3 compatibility.
The Rails experience on JRuby continues to get better.
16 Jul 2010 6:02pm GMT
Pat Eyler: Lone Star Ruby Conf Speaker Interview: Jesse Wolgamott
Today's a twofer for the Lone Star Ruby Conference. My third interview (second today) is with Jesse Wolgamott (@jwo) who's presenting "Battle of NoSQL stars: Amazon's SDB vs Mongoid vs CouchDB vs RavenDB ". Jesse shares some thoughts about NoSQL and the conference. NoSQL looks like it's gaining momentum. Why should Rubyists be interested in the topic? Jesse Once you reach the point in
16 Jul 2010 2:39pm GMT
Pat Eyler: Lone Star Ruby Conf Speaker Interview: Nephi Johnson
Okay, time for a second interview with a Lone Star Ruby Conference speaker. This time, Nephi Johnson (@d0c_s4vage) talks a bit about his presentation - "Less-Dumb Fuzzing and Ruby Metaprogramming". Fuzzing isn't always well understood. Can you describe fuzzing, and tell us what situations it's a good fit for? Nephi Fuzzing is a term used to describe the process of feeding an application
16 Jul 2010 12:04pm GMT
Tomasz Wegrzanowski: Arrays are not integer-indexed Hashes
We use a separate Array type even though Ruby Hashes can be indexed by integers perfectly well (unlike Perl hashes which implicitly convert all hash keys to strings, and array keys to integers). Hypothetically, we could get rid of them altogether and treat ["foo", "bar"] as syntactic sugar for {0=>"foo", 1=>"bar"}.
Now there are obviously some performance reasons for this - these are mostly fixable and a single data structure can perform well in both roles. And it would break backwards compatibility rather drastically, but let's ignore all that and imagine we're designing a completely fresh language which simply looks a lot like Ruby.
What would work
First, a lot of things work right away like [], []=, ==, size, clear, replace, and zip.
The first incompatibility is with each - for hashes it yields both keys and values, for arrays only values, and we'd need to decide one way or the other - I think yielding both makes more sense, but then there are all those non-indexable enumerables which won't be able to follow this change, so there are good reasons to only yield values as well. In any case, each_pair, each_key, and each_value would be available.
Either way, one more change would be necessary here - each and everything else would need to yield elements sorted by key. There are performance implications, but they're not so bad, and it would be nicer API.
Hash's methods keys, values, invert, and update all make perfect sense for Arrays. With keys sorted, first, last, and pop would work quite well. push/<< would be slightly nontrivial - but making it add #succ of the last key (or 0 for empty hashes) would work well enough.
Collection tests like any?, all?, one?, none? are obvious once we decide each, and so is count. map/collect adapts to hashes well enough (yielding both key and value, and returning new value).
Array methods like shuffle, sort, sample, uniq, and flatten which ignore indexes (but not their relative positions) would do likewise for hashes, so flattening {"a"=>[10,20], "b"=>30} would result in [10,20,30] ("a" yields before "b").
Enumerable methods like min/max/min_by/max_by, find, find_index, inject would do likewise.
include? checks values for Arrays and keys for hashes - we can throw that one out (or decide one way or the other, values make more sense to me), and use has_key?/has_value? when it matters.
reverse should just return values, but reverse_each should yield real keys.
I could go on like this. My point is - a lot of this stuff can be made to work really well. Usually there's a single behavior sensible for both Arrays, and Hashes, and if you really need something different then keys, values, or pairs would usually be a suitable solution.
What doesn't work
Unfortunately some things cannot be made to work. Consider this - what should be the return value of {0 => "zero", 1 => "one"}.select{|k,v| v == "one"}?
If we treat it as a hash - let's say a mapping of numbers to their English names, there is only one correct answer, and everything else is completely wrong - {1=>"one"}.
On the other hand if we treat it as an array - just an ordered list of words - there is also only one correct answer, and everything else is completely wrong - {0=>"one"}.
These two are of course totally incompatible. And an identical problem affects a lot of essential methods. Deleting an element renumbers items for an array, but not for a hash. shift/unshift/drop/insert/slice make so sense for hashes, and methods like group_by and partition have two valid and conflicting interpretations. It is, pretty much, unfixable.
So what went wrong? Thinking that Arrays are indexed by integers was wrong!
In {0=>"zero",1=>"one"} association between keys and values is extremely strong - key 0 is associated with value "zero", and key 1 with value "one". They exist as a pair and everything that happens to the hash happens to pairs, not to keys or values separately - there are no operations like insert_value, delete_value which would just shift remaining values around from one key to another. This is the nature of hashes.
Arrays are not at all like that. In ["zero", "one"] association between 0 and "zero" is very weak. The real keys are not 0, and 1 - they're two objects devoid of any external meaning, whose only property is their relative partial order.
To implement array semantics on top of hashes, we need a class like Index.new(greater_that=nil, less_than=nil). Then a construction like this would have semantics we desire.
arr = {}
arr[Index.new(arr.last_key, nil)] = "zero"
arr[Index.new(arr.last_key, nil)] = "one"
If we use these instead of integers, hashes can perform all array operations correctly.
# shift
arr.delete(arr.first_key)
# unshift
arr[Index.new(nil, arr.first_key)] = "minus one"
# select - indexes for "zero" and "two" in result have correct order
["zero", "one", "two"].select{|key, value| value != "one"}
# insert - nth_key only needs each
arr[Index.new(arr.nth_key(0), arr.nth_key(1))] = "one and half"
And so the theory is satisfied. We have a working solution, even if highly impractical one. Of course all these Index objects are rather hard to use, so the first thing we'd do is subclassing Hash so that arr[i] would really mean arr[arr.nth_key(i)] and so on, and there's really no point yielding them in #each and friends... oh wait, that's exactly where we started.
In other words, unification of arrays and hashes is impossible - at least unless you're willing to accept a monstrosity like PHP where numerical and non-numerical indexes are treated differently, and half of array functions accept a boolean flag asking if you'd rather have it behave like an array or like a hash.
16 Jul 2010 8:44am GMT
Tomasz Wegrzanowski: Random sampling or processing data streams in Ruby
It might sound like I'm tackling a long solved problem here - sort_by{rand}[0, n] is a well known idiom, and in more recent versions of Ruby you can use even simpler shuffle[0, n] or sample(n).
They all suffer from two problems. The minor one is that quite often I want elements in the sample to be in the same relative order as in the original collection (this in no way implies sorted) - what can be dealt with by a Schwartzian transform to [index, item] space, sampling that, sorting results, and transforming out to just item.
The major problem is far worse - for any of these to work, the entire collection must be loaded to memory, and if that was possible, why even bother with random sampling? More often than not, the collection I'm interested in sampling is something disk-based that I can iterate only once with #each (or twice if I really really have to), and I'm lucky if I even know its #size in advance.
By the way - this is totally unrelated, but I really hate #length method with passion - collections have sizes, not "lengths" - for a few kinds of collections we can imagine them arranged in a neat ordered line, and so their size is also length, but it's really lame to name a method after special case instead of far more general "size" - hashtables have sizes not lengths, sets have sizes not lengths, and so on - #length should die in fire!
When size is known
So we have a collection we can only iterate once - for now let's assume we're really lucky and we know exactly how many elements it has - this isn't all that common, but it happens every now and then. As we want n elements out of size, probability of each element being included is n/size, and so select{ n > rand(size) } will nearly do the trick - even keeping samples in the right order... except it will only return approximately n elements.
If we're sampling 1000 out of a billion we might not really care all that much, but it turns out it's not so difficult to do better than that. Sampling n elements out of [first, *rest] collection neatly reduces to: [first, *rest.sample(n-1)] with n/size probability, or rest.sample(n) otherwise. Except Ruby doesn't have decent tail-call optimization, so we'll use counters for it.
module Enumerable
def random_sample_known_size(wanted, remaining=size)
if block_given?
each{|it|
if wanted > rand(remaining)
yield(it)
wanted -= 1
end
remaining -= 1
}
else
rv = []
random_sample_known_size(wanted, remaining){|it| rv.push(it) }
rv
end
end
end
This way of sampling has an extra feature that it can yield samples one at a time and never needs to store any in memory - something you might appreciate if you want to take a couple million elements out of 10 billions or so, and you will not only avoid loading them to memory, you will be able to use the results immediately, instead of only when the entire input finishes.
This is only possible if collection size is known - if we don't know if there's 1 element ahead or 100 billion, there's really no way of deciding what to put in the sample.
If you cannot fit even the sample in memory at once, and don't know collection size in advice - it might be the easiest thing to iterate twice, first to compute the size, and then to yield random records one at a time (assuming collection size doesn't change between iterations at least). CPU and sequential I/O are cheap, memory and random I/O are expensive.
When size is unknown
Usually we don't know collection size in advance, so we need to keep a running sample - initialize it with the first n elements, and then for each element that arrives replace a random one from the sample with probability n / size_so_far.
The first idea would be something like this:
module Enumerable
def random_sample(wanted)
rv = []
size_so_far = 0
each{|it|
size_so_far += 1
j = rand(size_so_far)
rv.delete_at(j) if wanted == rv.size and wanted > j
rv.push(it) if wanted > rv.size
}
rv
end
end
It suffers from a rather annoying performance problem - we're keeping the sample in a Ruby Array, and while they're optimized for adding and removing elements at both ends, deleting something from the middle is a O(size) memmove.
We could replace rv.delete_at(j); rv.push(it) with rv[j] = it to gain performance at cost of item order in the sample... or we could do that plus Schwarzian transform into [index, item] space to get correctly ordered results fast. This only matters once sample size reaches tens of thousands, before that brute memmove is simply faster than evaluating extra Ruby code.
module Enumerable
def random_sample(wanted)
rv = []
size_so_far = 0
each{|it|
size_so_far += 1
j = wanted > rv.size ? rv.size : rand(size_so_far)
rv[j] = [size_so_far, it] if wanted > j
}
rv.sort.map{|idx, it| it}
end
end
This isn't what stream processing looks like!
The algorithms are as good as they'll get, but API is really not what we want. When we actually do have an iterate-once collection, we usually want to do more than just collect a sample. So let's encapsulate such continuously updated sample into Sample class:
class Sample
def initialize(wanted)
@wanted = wanted
@size_so_far = 0
@sample = []
end
def add(it)
@size_so_far += 1
j = @wanted > @sample.size ? @sample.size : rand(@size_so_far)
@sample[j] = [@size_so_far, it] if @wanted > j
end
def each
@sample.sort.each{|idx, it| yield(it)}
end
def total_size
@size_so_far
end
include Enumerable
end
It's a fully-featured Enumerable, so it should be really easy to use. #total_size will return count of all elements seen so far - calling that #size would conflict with the usual meaning of number of times #each yields. You can even nondestructively access the sample, and then keep updating it - usually you wouldn't want that, but it might be useful for scripts that run forever and periodically save partial results.
To see how it can be used, here's a very simple script, which reads a possibly extremely long list of URLs, and prints a sample of 3 by host. By the way notice autovivification of Samples inside the Hash - it's a really useful trick, and Ruby's autovivification can do a lot more than Perl's.
require "uri"
sites = Hash.new{|ht,k| ht[k] = Sample.new(3)}
STDIN.each{|url|
url.chomp!
host = URI.parse(url).host rescue next
sites[host].add(url)
}
sites.sort.each{|host, url_sample|
puts "#{host} - #{url_sample.total_size}:"
url_sample.each{|u| puts "* #{u}"}
}
So enjoy your massive data streams.
16 Jul 2010 4:23am GMT
15 Jul 2010
Planet Ruby
Pat Eyler: LSRC Speaker Interview with David Copeland
With the Lone Star Ruby Conference just over a month away, I thought it would be a good idea to talk to some of the presenters. David Copeland (@davetron5000) is giving a talk about a topic that resonated with me, so I sent off an email to find out more about what he thought would make his presentation and the conference worthwhile. I've never been a big 'web app' kind of guy, so I was excited
15 Jul 2010 3:19pm GMT
Pat Eyler: Ruby|Web Interview
Mike Moore, one of the big movers behind MountainWest RubyConf and the UtahValley.rb is getting the ball moving for another Ruby-centric conference -Ruby|Web. He was kind enough to sit down with me and share his thoughts about Regional Ruby Conferences and how Ruby|Web fits into that space. SLC already has MWRC, why another Ruby (ok, Ruby+) conference? Mike Moore There are so many great
15 Jul 2010 10:08am GMT
Tomasz Wegrzanowski: Synchronized compressed logging the Unix way
In good Unix tradition if a program generates some data, in general it should write it to STDOUT, and you'll redirect it to the right file yourself.
There are two problems with that, both easily solvable in separation:
- If it's a lot of data, you want to store it compressed. It would be bad Unix to put compression directly in the program - the right way is to pipe its output through gzip with program | gzip >logfile.gz. gzip is really fast, and usually adequate.
- You want to be able to see what were the last lines written out by the program at any time. Especially if it appears frozen. Sounds trivial, but thanks to a horrible misdesign of libc, and everything else based on it, data you write gets buffered before being actually written - a totally reasonable thing - and there are no limits whatsoever how long it can stay in buffers! Fortunately it is possible to turn this misfeature off with a single line of STDOUT.sync=true or equivalent in other languages.
Unfortunately while both fixes involve a single line of obvious code - there's no easy way to solve them together. Even if you flushed all data from the program to gzip, gzip can hold onto it indefinitely. Now unlike libc which is simply broken, gzip has a good reason - compression doesn't work on one byte at a time - it takes a big chunk, compresses it, and only then writes it all out.
Still, even if it has good reasons not to flush data as soon as possible, it can and very much should flush it every now and then - with flushing every few seconds reduction in compression ratio will be insignificant, and it will be possible to find out why the program frozen almost right away. The underlying zlib library totally has this feature - unfortunately command line gzip utility doesn't expose it.
So I wrote this:
#!/usr/bin/env ruby
require 'thread'require 'zlib'
def gzip_stream(io_in, io_out, flush_freq)
fh = Zlib::GzipWriter.wrap(io_out)
lock = Mutex.new
Thread.new{
while true
lock.synchronize{
return if fh.closed?
fh.flush if fh.pos > 0
}
sleep flush_freq
end
}
io_in.each{|line|
lock.synchronize{
fh.print(line)
}
}
fh.close
end
gzip_stream(STDIN, STDOUT, 5)
It reads lines on stdin, writes them to stdout, and flushes every 5 seconds (or whatever you configure) in a separate Ruby thread. Ruby green threads are little more than a wrapper over select() in case you're wondering. The check that fh.pos is non-zero is required as flushing before you write something seems to result in invalid output.
Now you can program | gzip_stream >logfile.gz without worrying about data getting stuck on the way (if you flush in your program that is).
15 Jul 2010 1:40am GMT








