parboiled2

a macro-based PEG-parsers
generator for Scala 2.10+

Alexander Myltsev
alexander-myltsev@github
linkedin.com/in/alexandermyltsev
alex_myltsev@twitter


This talk: http://myltsev.com/ScalaDays2014

my credentials

  • Google SoC'13 developed initial structure
  • porting to scala-js

parsing task

  • input: stream of characters or bytes
  • output: is it structured? what is the structure?


example:

scala?!> val input = "1 + 2 * 3"

scala?!> new Parser(input).run.arithmetical_expression?
true

scala?!> new Parser(input).run.AST_Structure
     ☂   +     ✂
     ♎  / \    ☎
Some ☭ 1   *   ☢
     ♎    / \  ❄
     ❤   2   3 ❤
                    

what motivated us

  • parboiled 1.x used by numerous projects
  • Scala macros introduced
  • save API and improve performance

existing projects

  • tools and libs dealing with a language atop of JVM
  • use Scala combinator parsers or parboiled 1.x? —
    consider using parboiled2

define requirements

  1. work inside of Scala ecosystem
    • tools: IDE, debugger, profiler, etc.
    • Java — not necessary
  2. work fast
  3. easy to maintain — typed embedded DSL
  4. few dependencies

existing alternatives

  • barehanded parser
  • ANTLR — ANother Tool for Language Recognition
  • parser combinators (Haskell Parsec, Scala parser combinators, etc.)


either different IDE, or slower, or harder to maintain

be based on RegEx?

can't parse recursive data: arithmetical expr, json, etc.


scala?!> new RegExParser({
                           "location": "Kosmos Belin",
                           "presentation": {
                             "name" : "parboiled2",
                             "geeky": true
                           }
                         }).run
                    

compare to parboiled 1.x

  • ~10 times slower
  • less powerful DSL
  • more dependencies

parboiled2

gentle introduction

core features

  • Parsing Expression Grammar
  • DSL: flexible, type-safe
  • compile-time optimization
  • parsing errors reporting
  • one parsing phase, no lexer required
  • small API

install it

  • just a Scala lib
  • build against 2.10.4 and 2.11.1

libraryDependencies+= "org.parboiled" %% "parboiled" % "2.0.0"

import org.parboiled2._
                    

toy parser


abstract class Parser {
  def input: ParserInput
}

class ToyParser(val input: ParserInput) extends Parser {
  def InputLine = rule { "a" ~ ("b" | "c") }
}
                    

toy parser


def InputLine = rule { "a" ~ ("b" | "c") }
                    

> new ToyParser("ab").InputLine.run()
Try[Unit] = Success(())
                    

> new ToyParser("ac").InputLine.run()
Try[Unit] = Success(())
                    

> new ToyParser("d").InputLine.run()
Try[Unit] = Failure(org.parboiled2.ParseError)
                    

> val p = new ToyParser("d")
p: ToyParser = ToyParser@12345

> val Failure(err: ParseError) = p.InputLine.run()
err: ParseError = org.parboiled2.ParseError

> p.formatError(err)
String =
Invalid input 'd', expected 'a' (line 1, column 1):
d
^
                    

toy parser: step by step


def InputLine = rule { "a" ~ ("b" | "c") }

input: "ac"
                    
  1. InputLine starts — calls "~" subrule — cursor at 0 pos
  2. "~" matches left "a", and if succeeds then right "|"
  3. rule "a" is matched against input(0) — cursor is at 1
  4. "|" matches left "b", and if fails then matches "c"
  5. "b" is failed to match — cursor stays at 1
  6. "c" is succeeded to match — cursor is moved to 2
  7. match is successful

how to produce AST?

value stack & typed rules

exploring value stack

  1. rules do not return any value
  2. push and pop result from ValueStack
    • implicit temporary storage
    • e.g. used for constructing AST, "in-phase" computations
  3. rule is typed according to its interaction with ValueStack

rule typing


class Rule[-I <: HList, +O <: HList]
                    
  • I   <: HList   number and types the rule pops off
  • O <: HList   number and types the rule pushes on
  • types are inferred (in most cases)
  • aliases

type RuleN[L <: HList]   = Rule[HNil, L]
                    

type Rule0               = RuleN[HNil]
                    

type Rule1[T]            = RuleN[T :: HNil]
                    

// B is pushed after A
type Rule2[A, B]         = RuleN[A :: B :: HNil]
                    

type PopRule[L <: HList] = Rule[L, HNil]
                    

rules groups

  • basic chars matchers
  • rules combinators and modifiers
  • parser actions

basic chars matchers

  • match input and cause progress
  • do nothing with ValueStack

implicit def ch(c: Char): Rule0
rule { 'a' } // rule { ch('a') }
                    

implicit def str(s: String): Rule0
rule { "ab" } // rule { str("ab") }

rule { 42 } // compile-time error
                    

implicit def predicate(p: CharPredicate): Rule0
CharPredicate.Digit, CharPredicate.LowerHexLetter, etc.
                    

def EOI: Char
                    

rules combinators & modifiers

e1 ~ e2

  • matches if e1 matches and then if e2 matches
  • type

def a: Rule1[Int]         = // ...
def b: Rule1[String]      = // ...
def c: Rule2[Int, String] = rule { a ~ b }
                    

Rule[     , A    ] ~ Rule[     , B  ] = Rule[     , A:B    ]

Rule[A:B:C, D:E:F] ~ Rule[F    , G:H] = Rule[A:B:C, D:E:G:H]

Rule[A    , B:C  ] ~ Rule[D:B:C, E:F] = Rule[D:A  , E:F    ]

Rule[A    , B:C  ] ~ Rule[D:C  , E:F] - is Illegal if D != B
                    

e1 | e2

  1. tries to match e1
  2. if matches then succeeds
  3. otherwise — result of e2 matching

optional(e)

  • tries to match e
  • always succeeds, even if e fails to match

optional(e: Rule0)    : Rule0
optional(e: Rule1[T]) : Rule1[Option[T]]
                    

zeroOrMore(e) and oneOrMore(e)

  • zeroOrMore runs e until it fails — always succeeds
  • oneOrMore runs e until it fails — succeeds if inner rule is succeeded at least once

zeroOrMore(e: Rule0)    : Rule0
zeroOrMore(e: Rule1[T]) : Rule1[Seq[T]]

oneOrMore(e: Rule0)     : Rule0
oneOrMore(e: Rule1[T])  : Rule1[Seq[T]]
                    

parser actions

  • can build recognizers so far: "does it have structure?"
  • need "actions" to produce result

push

pushes value onto ValueStack


rule { "true" ~ push(true) }
					

push(e: Unit)       : Rule0    - pushes nothing
					

push(e: L <: HList) : RuleN[L] - pushes all values of L
					

push(e: T)          : Rule1[T] - pushes a value of T
					

capture

pushes additional matched string onto ValueStack


rule { capture("42") } - if input matches "42" then pushes "42"
					

capture(zeroOrMore(CharPredicate.Digit)) : Rule1[String]
					

capture("true" ~ push(true))             : Rule2[Boolean, String]
					

a ~> (func)

transforms top elements of the Value Stack
into some other object(s)


(foo: Rule1[String])      ~> ((s: String) ⇒ s.toInt)  : Rule1[Int]
					

(foo: Rule1[String])      ~> (s ⇒ s.toInt)            : Rule1[Int]
					

(foo: Rule1[String])      ~> (_.toInt)                : Rule1[Int]
					

(foo: Rule2[Int, String]) ~> ((i, s) ⇒ i + s.toInt)   : Rule1[Int]
					

(foo: Rule1[String])      ~> (println(_))             : Rule0
					

(foo: Rule1[String])      ~> (() ⇒ 17)                : Rule2[String, Int]
(foo: Rule1[String])      ~  push(17)                 : Rule2[String, Int]
					

(foo: Rule1[String]) ~> (s ⇒ s.toInt :: 3.14 :: HNil) : Rule2[Int, Double]
					

case class Person(name: String, age: Int)
(foo: Rule2[String, Int]) ~> Person                   : Rule1[Person]
					

reduction rules

parser makes progress leaving value stack unchanged


reduce "1+2+3+4+5" to (15:Int)
                    

(number: Rule1[Int]) ~ zeroOrMore("+" ~ number) : Rule2[Int, Seq[Int]]
                    

def reductor = "+" ~ number ~>
((a:Int, b: Int) => a + b) : Rule[Int::HNil, Int::HNil]

(number: Rule1[Int]) ~ zeroOrMore(reductor) : Rule1[Int]
                    

capture(CharPredicate.Digit) ~ optional("h" ~> ((s: String) ⇒ s + "hex"))

(factor: Rule1[Int]) ~ oneOrMore("*" ~ factor ~> ((a: Int, b) ⇒ a * b))
                    

calculator


> new Calculator("1+(2+4)*3").InputLine.run() 
scala.util.Success(25)
                    
class Calculator(val input: ParserInput) extends Parser {
  def Digits = rule { oneOrMore(CharPredicate.Digit) } // Rule1[Stirng]

  def Number = rule { capture(Digits) ~> (_.toInt) }   // Rule1[Int]
  def Parens = rule { '(' ~ Expression ~ ')' }
  def Factor = rule { Number | Parens }
  def Term = rule { Factor ~ zeroOrMore('*' ~ Factor ~> ((_: Int) * _)
                                      | '/' ~ Factor ~> ((_: Int) / _)) }
  def Expression: Rule1[Int] = rule {
      Term ~ zeroOrMore('+' ~ Term ~> ((_: Int) + _)
                      | '-' ~ Term ~> ((_: Int) - _)) }
  def InputLine = rule { Expression ~ EOI }

}

under the hood

rule macro


case class Rule(matched: Boolean)
                    

abstract class Parser {
  def rule(r: Rule): Rule = macro ParserMacros.ruleImpl

  def input: ParserInput
  def __nextChar(): Char = // ...
  def __cursor: Int = // ...
}
                    

object ParserMacros {	
  def ruleImpl(ctx: ParserContext)(r: ctx.Expr[Rule]): ctx.Expr[Rule] = {
    // ...
  }
} 
  					

two step render()


def ruleImpl(ctx: ParserContext)(r: ctx.Expr[Rule]): ctx.Expr[Rule] = 
  reify {
    OpTree(r.tree).render().splice
  }
                    
  1. input: Scala compiler AST
  2. IR: parboiled2 OpTree
  3. output: effective Scala code AST

quasiquotes match


def r = rule { "a" }
def r = rule { str("a") }
def r = rule { CurrentParser.this.str("a") }
  					

r.tree = Apply(Select(This(TypeName("CurrentParser")), TermName("str")),
               List(Literal(Constant("a"))))
  					

r.tree match {
  case Apply(Select(_, TermName("str")), List(Literal(Constant(s)))) ⇒
    StringMatch(s) // s is "a"
}
  					

r.tree match {
  case q"$a.this.str($s)" ⇒ StringMatch(s) // a is CurrentParser
                                           // s is "a"
}
  					

OpTree


object OpTree {
  def apply(tree: Tree): OpTree =
    r.tree match {
      case q"$a.this.str($s)" ⇒ StringMatch(s)
      case q"$a.this.ch($ch)" ⇒ CharMatch(ch)
      case q"$lhs.|($rhs)"    ⇒ FirstOf(OpTree(lhs), OpTree(rhs))
      case q"$lhs.~($rhs)"    ⇒ Sequence(OpTree(lhs), OpTree(rhs))
    }
}
                    

abstract class OpTree {
  def render(): Expr[Rule]
}

case class StringMatch(strTree: Tree)         extends OpTree
case class CharMatch(charTree: Tree)          extends OpTree
case class FirstOf(lhs: OpTree, rhs: OpTree)  extends OpTree
case class Sequence(lhs: OpTree, rhs: OpTree) extends OpTree
                    

CharMatch


// (rule { ch('a') }).tree match { case q"$a.this.ch($charTree)" ⇒ ... }

case class CharMatch(charTree: Tree) extends OpTree {
  def render(): Expr[Rule] = reify {
    val p    = c.prefix.splice               // Current parser
    val char = c.Expr[Char](charTree).splice // : Char
    Rule(p.__nextChar() == char)
  }
}
                    

StringMatch: naive


// (rule { str('a') }).tree match { case q"$a.this.str($strTree)" ⇒ ... }

case class StringMatch(strTree: Tree) extends OpTree {
  def render(): Expr[Rule] = c.Expr[Rule](q''
    val p   = ${c.prefix} // Current parser
    val str = $strTree    // : String

    val res = str.forall(c ⇒ c == p.__nextChar())
    Rule(res)
'')
					

StringMatch: effective


// (rule { str('a') }).tree match { case q"$a.this.str($strTree)" ⇒ ... }

case class StringMatch(strTree: Tree) extends OpTree {
  def render(): Expr[Rule] = c.Expr[Rule](q''
    val p   = ${c.prefix} // Current parser
    val str = $strTree    // : String

    var ix = 0
    while (ix < str.length && str.charAt(ix) == p.__nextChar()) {
      ix += 1
    }
    Rule(ix == str.length)
'')
					

Sequence: a ~ b


  // case q"$lhs.~($rhs)" ⇒ Sequence(OpTree(lhs), OpTree(rhs))

  case class Sequence(lhs: OpTree, rhs: OpTree) extends OpTree {
    def render(): Expr[Rule] = c.Expr[Rule](q''
    val res = lhs.render().matched && rhs.render().matched
    Rule(res)
  '')
}
                    

FirstOf: a | b


// case q"$lhs.|($rhs)" ⇒ FirstOf(OpTree(lhs), OpTree(rhs))

case class FirstOf(lhs: OpTree, rhs: OpTree) extends OpTree {
  def render(): Expr[Rule] = c.Expr[Rule](q''
    val mark = p.__cursor
    if (lhs.render().matched) {
      Rule(true)
    } else {
      p.__cursor = mark
      val res = rhs.render().matched
      Rule(res)
    }
  '')
}
					

benchmark


> cappi::benchmark

            benchmark    ms linear runtime
 Parboiled1JsonParser 85.64 ==============================
  arboiled2JsonParser 13.17 ====
             Argonaut  7.01 ==
         Json4SNative  8.06 ==
        Json4SJackson  4.09 =
                	

Thanks

Q & A

@Alex_Myltsev

parboiled2.org