用Parsley写富状态的解析器

在这篇博客里我会以一个玩具解析器解释怎么在Parsley里写富状态解析器。

初见富状态解析器

Parsley允许你在解析引用／设置一个状态，这个有点像Haskell的State monad。

scala

import parsley.state.{RefMaker, Ref}

case class State(x: Int)
val p = State(0).makeRef { (state: Ref[State]) =>
    // 在这个块里我们可以与此状态交互
    // 这个块得返回一个解析器

    // 对于一个解析器，我们可以用flatMap / >>=取得解析结果并在副作用中使用它
    string("foo").flatMap { (str: String) =>
        state.update { (s: State) =>
            // 在这个块里我们可以直接拿到状态里的值
            State(s.x + 1)
        }
    }
}

以上代码让你很快可以糊出你想要的效果，但它的运行效率不行，因为Parsley的内部架构不能优化monadic的解析器。正确的做法（“applicative解析器”）没这么直白，但是也不难，上例子：

栗子

假设我们的输入长这样：

boo()
bar(1)
    baz(1, 2)
        fizz()
        buzz()
    h(1)
g(2)
n(3)

每一行有一个语句（简化为many(letter)）然后后面有个括号，括号里可能会引用之前的行数。这个语法很明显是状态敏感的（引用之前的解析结果，还用缩进表示语句块）。

事已至此，先写点类型

先看看解析结果的形状是啥：

scala

sealed trait Pf
case class Stmt(body: String, ref: List[String]) extends Pf
case class Scope(var body: List[Pf]) extends Pf

接下来我们得想想这个解析器的状态是什么：

当前的缩进数
之前的所有解析结果

这里偷懒直接用了一个可变的栈，不过肯定有办法让他更纯点的。

遂决定状态的类型如下：

scala

case class State(
    var level: Int,
    lines: List[String],
    var scope: List[Scope]
)

已经很命令式了，所以就全可变吧（无慈悲

开写：

scala

import parsley.Parsley
import parsley.Parsley.{many, atomic, pure}
import parsley.character.{char, item, stringOfMany, letter, digit}
import parsley.combinator.sepBy
import parsley.state.{RefMaker, Ref, StateCombinators}
import parsley.syntax.character.{charLift, stringLift}

// 后面会用，简单起见就不用lexer模块里的了
val number = digit.foldLeft1[Int](0)((n, d) => n * 10 + d.asDigit)

解析语句

先来一个富状态的解析器：

scala

val p: Parsley[Scope] = State(0, List(), List(Scope(List()))).makeRef { (state: Ref[State]) =>
    ???
}

然后把状态加进去，我先用flatMap写一下大概逻辑，后面再把它换成applicative的。

scala

val p: Parsley[Scope] = State(0, List(), List(Scope(List()))).makeRef { (state: Ref[State]) =>
    val stmt: Parsley[Stmt] = (
        many(letter).map(_.mkString).flatMap { str =>
            state.update { s => s.copy(lines = s.lines :+ str) }
        } <~ '('
        <~>
        sepBy(number, ", ").flatMap { xs => state.gets { s =>
                xs.map { (x: Int) => s.lines(x - 1) }
            }
        } <~ ')'
    ).map(???)
}

上面这段有点问题嗷，这里会报类型错误。在flatMap里你应该返回一个List[Stmt]，但是state.gets会返回Parsley[List[Stmt]]，但不用这个方法我们又拿不到State的值。可能拿flatMap也能写，但是这里用applicative就能把类型对上。

scala

p.flatMap { res => state.update(res) }
// 等于
state.update { p }

不过state.update返回的是Parsley[Unit]，我们得再get把它拿出来。

scala

// 书接上回
val stmt: Parsley[Stmt] = (
    state.update(
        many(letter).map(_.mkString).map { str => (s: State) =>
                s.copy(lines = s.lines :+ str)
        }
    ) ~> '(' ~> state.get.map(_.lines.last) // 或者直接gets
    <~> ??? <~ ')' // 别忘了括号里还有东西
).map { res => Stmt(res._1, res._2) }

在重新写之前写不出来的逻辑前我们先来看看gets的类型。

scala

def gets[B](pf: Parsley[State => B]): Parsley[B]

有点摸不着头脑吧？Parsley[A]的意思是从字符串解析个A出来，但解析个State => B出来是啥意思？

先想下最终结果是啥，是List[Stmt]，那这就是我们这里的B了。

scala

def gets[List[Stmt]](pf: Parsley[State => List[Stmt]]): Parsley[List[Stmt]]

先不说到底是啥意思，如果我们map一下，就能搞到一个类型是Parsley[State => List[Stmt]] 的东西。

scala

// ... <~>
state.gets {
    sepBy(number, ", ").map { ??? }
} <~ ')'

那就只差拿State里面的东西了。。。

scala

sepBy(number, ", ").map { xs => (s: State) =>
    xs.map { (x: Int) => s.lines(x - 1) }
}

然后突然发现返回值的类型就是我们想要的State => List[Stmt] 😄

现在为止：

scala

val stmt: Parsley[Stmt] = (
    state.update(
        many(letter).map(_.mkString).map { str => (s: State) =>
                s.copy(lines = s.lines :+ str)
        }
    ) ~> '(' ~> state.gets(_.lines.last)
    <~> state.gets {
        sepBy(number, ", ").map { xs => (s: State) =>
            xs.map { (x: Int) => s.lines(x - 1) }
        }
    } <~ ')'
).map { res => Stmt(res._1, res._2) }

添加缩进逻辑

现在得解析代码块了

scala

val scope: Parsley[Scope] = many(
    state.update((
        many(' ').map(_.length) <~> stmt <~ '\n'
    ).map { res => (s: State) =>
        ???
    }
)) ~> state.get.map(???)

最终结果是很多行缩进+语句，和stmt里的逻辑是差不太多的，所以应该不太难。

假设我们知道上一行的缩进为i，本行合法的缩进有下面几种情况：

还是 i，没有新建块，直接往现在的块加

scala

    if i == s.level then
        sc.head.body = sc.head.body :+ sm

i + 4，新建一个带有本行的语句块，然后把它挂在之前的块底下

scala

    else if i == s.level + 4 then
        val newScope = Scope(List(sm))
        sc.head.body = sc.head.body :+ newScope
        s.level += 4
        s.scope = newScope :: s.scope

i - 4，把现在的块给pop掉，然后往上一层的块里加

scala

    else if i == s.level - 4 then
        // prev scope
        val t = sc.tail
        t.head.body = t.head.body :+ sm
        s.level -= 4
        s.scope = t

好，那到最后一行的时候呢？

如果之前做的都是对的，栈最底下的应该就是根语句块。

scala

state.get.map { s => s.scope.last }

放到一起：

scala

val scope: Parsley[Scope] = many(
    state.update((
        many(' ').map(_.length) <~> stmt <~ '\n'
    ).map { res => (s: State) =>
        val sc = s.scope
        val (i, sm) = res
        if i == s.level then
            sc.head.body = sc.head.body :+ sm
        else if i == s.level + 4 then
            // new scope
            val newScope = Scope(List(sm))
            sc.head.body = sc.head.body :+ newScope
            s.level += 4
            s.scope = newScope :: s.scope
        else if i == s.level - 4 then
            // prev scope
            val t = sc.tail
            t.head.body = t.head.body :+ sm
            s.level -= 4
            s.scope = t
        else
            fail()  // TODO: better errors
        s
    }
)) ~> state.get.map { s => s.scope.last }

全部代码：

全部代码

scala

val p: Parsley[Scope] = State(0, List(), List(Scope(List()))).makeRef { (state: Ref[State]) =>
    val stmt: Parsley[Stmt] = (
        state.update(
            many(letter).map(_.mkString).map { str => (s: State) =>
                    s.copy(lines = s.lines :+ str)
            }
        ) ~> '(' ~> state.gets(_.lines.last)
        <~> state.gets {
            sepBy(number, ", ").map { xs => (s: State) =>
                xs.map { (x: Int) => s.lines(x - 1) }
            }
        } <~ ')'
    ).map { res => Stmt(res._1, res._2) }
    val scope: Parsley[Scope] = many(
        state.update((
            many(' ').map(_.length) <~> stmt <~ '\n'
        ).map { res => (s: State) =>
            val sc = s.scope
            val (i, sm) = res
            if i == s.level then
                sc.head.body = sc.head.body :+ sm
            else if i == s.level + 4 then
                // new scope
                val newScope = Scope(List(sm))
                sc.head.body = sc.head.body :+ newScope
                s.level += 4
                s.scope = newScope :: s.scope
            else if i == s.level - 4 then
                // prev scope
                val t = sc.tail
                t.head.body = t.head.body :+ sm
                s.level -= 4
                s.scope = t
            else
                fail()  // TODO: better errors
            s
        }
    )) ~> state.get.map { s => s.scope.last }
    scope
}

输入示例

scala

val testStr = """
boo()
bar(1)
    baz(1, 2)
        fizz()
        buzz()
    h(1)
g(2)
n(3)
            """.trim() + "\n"
        val output = """
boo()
bar(boo)
    baz(boo, bar)
        fizz()
        buzz()
    h(boo)
g(bar)
n(baz)
            """.trim()
// toString implementation omitted
assert(p.parse(testStr).get.toString() === output)

感想

我刚写完这个的时候脑子里有个问题，如果我写命令式那一个for循环就秒了，那用解析器组合子有啥好处呢？原作者说：

Quote

For most languages, the grammar is constructed in such a way that it remains context-free. This is, primarily, because context-sensitive grammars are a brutal combination of hard to express and hard to parse efficiently. Indentation-sensitive parsing can be considered an example of a context-sensitive grammar, though, in practice, some compilers like to shunt the work out to the lexer to make the grammar context-free again (this is the case with Python).

~~顺便提一嘴教程里富状态解析器的教程到现在都还没写~~

但是我还是觉得用解析器组合子是有优点的。

自动生成报错信息
更函数式，更容易维护的代码
解析器组合子又不是不能干

希望这篇博客有帮到你（特别是缩进那块的逻辑可以复制粘贴的）。这代码质量说不上太好，有点命令式，不过有些可以藏到状态里面。作为解析器组合子新手我能写出这么些东西我还是挺开心的。

解析顺利！

用Parsley写富状态的解析器 ​

初见富状态解析器 ​

栗子 ​

事已至此，先写点类型 ​

解析语句 ​

添加缩进逻辑 ​

感想 ​

用Parsley写富状态的解析器

初见富状态解析器

栗子

事已至此，先写点类型

解析语句

添加缩进逻辑

感想