Using Ruby 1.9 Ripper
While Ripper parses your code it continously fires events (or “calls callbacks”) when it finds something interesting. There are two types of events: scanner (lexer) and parser events.
The scanner basically goes through the code from the left to the right character by character. When it finds known things (such as a keyword, whitespace or a semicolon) it fires a corresponding even that you can react to. The parser works on a higher level and watches for known Ruby constructs (such as a symbol, a method call or a class definition) and also fires events.
You can check the available events by outputting Ripper::SCANNER_EVENTS
and Ripper::PARSER_EVENTS
.
You can respond to these events by simply defining methods named :"on_#{event_name}"
(omitting the @
character for scanner events). As long as you do not mess this up (which you might want to do) the parser always passes the results from the last inner parser events to the current parser event. E.g.:
require 'ripper'
class DemoBuilder < Ripper::SexpBuilder
def on_int(token) # scanner event
super.tap { |result| p result }
end
def on_binary(left, operator, right) # parser event
super.tap { |result| p result }
end
end
src = "1 + 1"
DemoBuilder.new(src).parse
This outputs:
[:@int, "1", [1, 0]]
[:@int, "1", [1, 4]]
[:binary, [:@int, "1", [1, 0]], :+, [:@int, "1", [1, 4]]]
When a scanner event is fired you can check the current position (it is passed to the event but you can also always call self.position
) which allows for tracking detailled positioning information. Positions are given as [row, column] with the row being 1-based. On parser level events the current position is not very useful (and not passed to your event callbacks) because parser events are fired when the parser recognizes a known ruby construct as completed - i.e. at the end of the construct.
Scanner events are fired “just so”, i.e. the scanner finds something and calls your callback method. The return values might or might not be passed to parser events. Parser events otoh build a meaningful tree and their return values are always passed to the next (outer) event. You can generally think of events being fired “from the inside out”, starting with lowlevel scanner events.
You can examine the hierarchie of these events by doing:
require "pp"
src = "1 + 1"
pp Ripper::SexpBuilder.new(src).parse
will output:
[:program,
[:stmts_add,
[:stmts_new],
[:binary, [:@int, "1", [1, 0]], :+, [:@int, "1", [1, 4]]]]]
You think of this as a nested method call where the first element of each array is the method name and the rest are the arguments. In the example above there would be 5 method calls. The first :@int
call would receive the arguments "1"
and [1, 0]
, the :binary
would receive ["1", [1, 0]], :+, ["1", [1, 4]]
. The other calls, like :program
would not receive any arguments.
When executed the (theoretical) interpreter would first evaluate the innermost arguments, right? That’s exactly what Ripper does, too. It will first fire the first @int event, then the second one and then pass the return values of these two events (together with the :+
operator token) to the next outer method, which is the :binary
event in this case.
(“Theoretical” of course refers to these particular s-expressions. There are languages that are very much based on exactly this concept, like e.g. Lisp.)
As you can see even though the scanner fires events on whitespace there aren’t any whitespace characters passed to any of the callbacks. I don’t know if there’s anything else happening to these but of course you can define callbacks for the different kinds of whitespace and do something useful with it. The same is true for comments and quite some stuff that doesn’t make a semantical difference in Ruby (such as parentheses for method calls etc.).
To examine all events in the order they are actually fired you can use the event log that ships with Ripper2Ruby:
src = "1 + 1"
Ripper::EventLog.out(src)
will output:
@int 1
@sp " "
@op +
@sp " "
@int 1
binary
stmts_new
stmts_add
program
I’m not an expert here but Ripper’s s-expressions and events seemed to make more sense to me than ParseTree’s stuff. Ripper still doesn’t seem to be completely consistent though.
E.g. for word lists (i.e. Arrays that are defined using %w()
syntax) there are different events fired depending whether you have %w()
or %W()
.
src = '%W(foo bar)'
pp Ripper::SexpBuilder.new(src).parse
outputs:
[:program,
[:stmts_add,
[:stmts_new],
[:words_add,
[:words_add,
[:words_new],
[:word_add, [:word_new], [:@tstring_content, "foo", [1, 3]]]],
[:word_add, [:word_new], [:@tstring_content, "bar", [1, 7]]]]]]
But on the other hand:
src = '%w(foo bar)'
pp Ripper::SexpBuilder.new(src).parse
outputs:
[:program,
[:stmts_add,
[:stmts_new],
[:qwords_add,
[:qwords_add, [:qwords_new], [:@tstring_content, "foo", [1, 3]]],
[:@tstring_content, "bar", [1, 7]]]]]
As you can see for qwords (i.e. the non-interpolating version) there seems to be a :qwords_add
and :qwords_new
event missing. I can’t see any good reason for this.
Also, Ripper seems to get the method call operator wrong when you use "::"
src = "A::b()"
pp Ripper::SexpBuilder.new(src).parse
outputs:
[:program,
[:stmts_add,
[:stmts_new],
[:method_add_arg,
[:call,
[:var_ref, [:@const, "A", [1, 0]]],
:".",
[:@ident, "b", [1, 3]]],
[:arg_paren, nil]]]]
Watch the period which should be a :"::"
symbol.
In quite some situations I’ve found the events ambigous or not explicit. E.g. for the closing parentheses in a words list like %w(foo bar)
Ripper fires a :@tstring_end
event - which is the same event as it fires for closing parentheses in Strings as in %(foobar)
.
It gets really weird when you try to build something from the events that Ripper fires for Heredocs or even stacked Heredocs combined with method calls on the Heredoc opener token - maybe the most weird Ruby construct anyway. In general though this stuff is fun to work with and quite obvious once you got the idea :)