Skip to content

Write Markdown Lexer and Parser in Zig

Published: at 11:48 AM

Write Markdown Lexer and Parser in Zig

Table of Contents

Open Table of Contents

Introduction

Markdown is a lightweight markup language with plain text formatting syntax. Writing a parser for it is a fascinating challenge because the syntax is context-sensitive (e.g., indentation matters, newlines behave differently in different blocks).

In this post, we’ll explore how to build a Markdown lexer and parser in Zig. We will leverage Zig’s powerful type system, specifically Tagged Unions (Sum Types) and Exhaustive Switch statements, to implement a robust Deterministic Finite Automaton (DFA).

Design Philosophy: Simple Lexer, Complex Parser

To understand why we made these choices, let’s look at the actual data flow in mdz. The architecture is built around a unidirectional pipeline where the Lexer feeds tokens into a stateful Parser.

Data Flow

+--------------+       +--------------+       +--------------+
|  Source Code | ----> |    Lexer     | ----> |    Parser    | ----> AST (Blocks)
+--------------+       +--------------+       +--------------+
                              |                      ^
                              |                      |
                        [Context-Free]         [Context-Aware]
                        - Tab                  - State: Empty
                        - Space                - State: MaybeTitle
                        - Sign (#, *, etc)     - State: CodeBlock
                        - Str                  - ...

The Lexer is intentionally designed to be “dumb”. It doesn’t know if a # is a header or a comment; it just sees a Sign. This simplicity allows the Lexer to be extremely fast, potentially using SIMD instructions to scan bytes in parallel.

The Parser, on the other hand, holds all the context. It maintains a State that evolves as it consumes tokens. This is where the complexity lives, as Markdown is context-sensitive.

The Parser Loop

Here is the core loop from src/parser.zig. Notice how it takes a token from the lexer, passes it to the DFA transition function (DFA.f), and checks if the state has reached a Done condition to emit a Block (like a Paragraph or Title).

pub fn next(self: *Self, lexer: *Lexer) ParseError!?Block {
    // 1. Initialize or recover state
    var state = if (self.recover_state) |state|
        State{ .state = state, .allocator = self.allocator }
    else
        State.empty(self.allocator);

    // 2. Loop through tokens
    while (lexer.next()) |value| {
        switch (value.item) {
            .ok => |token| {
                // 3. Apply Transition Function: f(State, Token) -> NewState
                try DFA.f(&state, token, value.span);
                
                // 4. Check if a Block is complete
                switch (state.state) {
                    .Done => {
                        self.recover_state = state.recover_state;
                        return state.value;
                    },
                    else => {
                        continue;
                    },
                }
            },
            // ... error handling ...
        }
    }
    return null;
}

By decoupling the token generation from the semantic interpretation, we make the system easier to test and optimize. The Lexer is just a fancy iterator, while the Parser contains the pure state machine logic.

Pull vs Push Architecture

The architecture described above is a Pull-based system. The Parser controls the flow by calling lexer.next() whenever it is ready to process a new token.

In a Push-based system, the Lexer would drive the process, reading the file continuously and “pushing” tokens into the Parser (e.g., via callbacks or an event queue).

We chose the Pull model because:

  1. Parser-Centric Control: The Parser is the complex part. Giving it control means it can easily pause, look ahead, or stop after parsing a single block (as seen in Parser.next returning ?Block).
  2. Zero Overhead: Zig excels at simple, synchronous control flow. A Pull lexer is just a function call. A Push lexer often implies inversion of control (callbacks) or concurrency overhead (channels), which adds unnecessary complexity for a single-threaded parser.
  3. State Management: In Markdown, the boundary between blocks isn’t always clear until you parse the content. The Pull model allows the Parser to maintain its state stack and effectively “pull” the structure out of the flat stream of tokens.

AST Representation: BitFlags vs. Trees

Most parsers represent Markdown as a tree of nodes (e.g., a Strong node containing a Text node). While expressive, this involves significant pointer chasing and memory allocation.

mdz takes a different approach for inline formatting. Instead of a tree, we use BitFlags.

We define a set of decorations:

pub const TextKind = enum(u8) {
    Bold,
    Italic,
    Strikethrough,
    LaTex,
    Code,
};

pub const Decorations = std.bit_set.StaticBitSet(5);

pub const Text = struct {
    decorations: Decorations, // e.g., { Bold, Italic }
    span: Span,
};

When the parser encounters **bold *italic***, it doesn’t build a nested structure. Instead, it flattens the text into spans with attributes:

  1. "bold " -> [Bold]
  2. "italic" -> [Bold, Italic]

Trade-offs:

This representation drastically reduces the memory footprint and simplifies the rendering logic (just check the bits and emit tags).

The Lexer: From Bytes to Tokens

The first step is to tokenize the input. Instead of working with raw characters in the parser, we categorize them into tokens. This simplifies the parser’s job.

In src/lexer.zig, we define the tokens. Zig’s union(enum) is perfect for this, as it allows tokens to carry data (payloads) only when necessary.

pub const TokenItemTag = enum {
    Tab,
    Space,
    LineEnd,
    Sign,
    AsciiNumber,
    Str,
    EOF,
};

pub const TokenItem = union(TokenItemTag) {
    const Self = @This();

    Tab: void,
    Space: void,
    LineEnd: void,
    Sign: u8,
    // Using slice to represent number content
    AsciiNumber: []const u8,
    Str: []const u8,
    EOF: void,

    // ... constructors (e.g., TokenItem.sign('#'))
};

The Lexer struct holds the state of the text traversal. We use std.unicode.Utf8Iterator to handle UTF-8 decoding seamlessly.

pub const Lexer = struct {
    buffer: []const u8,
    index: usize = 0,
    utf8_iterator: UTF8Iterator,

    pub fn next(self: *Self) ?Token {
        // ... check buffer bounds ...
        if (self.next_code()) |code|
            switch (code) {
                0x9 => self.tab(),
                '\n' => self.lineEnd(false),
                '0'...'9' => {
                    // Logic to consume full number
                },
                // Handle markdown special characters
                0x21...0x2F, 0x3A...0x40, 0x5B...0x60, 0x7B...0x7E => |c| self.sign(@intCast(c)),
                else => {
                    // Consume normal string
                },
            }
        else
            null;
        // ... update index ...
    }
};

This separation allows the parser to focus on structure rather than byte-level decoding.

The Parser: A State Machine (DFA)

The core logic of our parser lies in the State Machine. Markdown parsing depends heavily on “what we have seen so far”. For example, a # might be a header, or it might be just text inside a code block.

We define a StateKind enum and a corresponding StateItem union to hold context-specific data.

pub const StateKind = enum {
    Empty,
    MaybeTitle,
    MaybeTitleContent,
    MaybeCodeBlock,
    MaybeThematicBreak,
    NormalText,
    Done,
    // ... and many more
};

pub const StateItem = union(StateKind) {
    Empty: void,
    // For titles, we need to know the level (number of #)
    MaybeTitle: usize, 
    MaybeTitleContent: usize,
    
    // For text, we track the span
    NormalText: Span,
    
    Done: void,

    // ... helpers
};

The Transition Function

The heart of a DFA is the transition function: (CurrentState, Input) -> NextState. In our implementation (src/dfa/lib.zig), this is the f function.

    pub fn f(s: *State, token: Token, span: Span) ParseError!ReturnType {
        try switch (token) {
            .Sign => |sign| switch (sign) {
                '`' => Sign.code(s, span),
                '#' => Sign.title(s, span),
                '*' => Sign.boldItalic(s, span),
                // ... dispatch to specific handlers
                else => Sign.normal(s, span),
            },
            .Tab => Ends.tab(s, span),
            .Space => Ends.space(s, span),
            .LineEnd => Ends.lineEnd(s, span),
            .Str => |str| unicodes(s, str, span),
            // ...
        };
    }

Zig’s switch statements ensure that we handle every token type. If we add a new token type, the compiler will force us to update this logic.

Example: Parsing Titles

Let’s look at how we handle titles (# Title) in src/dfa/sign/title.zig.

When the parser encounters a # token, it calls Sign.title. Inside, we check the current state to decide what to do.

pub fn title(state: *State, span: Span) Error!void {
    switch (state.state) {
        // If we are at the start of a block
        .Empty => {
            // Transition to MaybeTitle state with count 1
            state.state = .{ .MaybeTitle = 1 };
        },
        // If we are already parsing a title prefix (e.g., "##")
        .MaybeTitle => |*size| {
            // Increment the level
            size.* += 1;
        },
        // If we are inside a code block
        .MaybeFencedCodeBegin => |s| {
            // Treat # as normal text/code
            try state.value.?.Paragraph.addCode(state.allocator, ...);
            state.toNormalText(span);
        },
        // Default fallback: treat as normal text
        .NormalText => |*s| {
            _ = s.enlarge(span.len);
        },
        // ...
    }
}

This pattern matches the DFA formal definition perfectly. We check the current state (e.g., MaybeTitle) and the input (implied by the function being title), and we transition to a new state or update the current one (accumulating #s).

Step-by-Step Trace: # Hello

To see it in action, let’s trace the full parsing lifecycle of the string # Hello.

  1. Start:

    • State: Empty
    • Value: null
  2. Lexer emits Sign('#'):

    • Action: Sign.title(Empty)
    • Transition: Empty -> MaybeTitle(1)
    • Meaning: We found one hash, it might be a title.
  3. Lexer emits Space:

    • Action: Ends.space(MaybeTitle(1))
    • Transition: MaybeTitle(1) -> MaybeTitleContent(1)
    • Meaning: The hashes ended with a space, so it is a valid H1 title.
  4. Lexer emits Str("Hello"):

    • Action: unicodes(MaybeTitleContent(1))
    • Effect: state.value is initialized as Block.Title(level=1).
    • Transition: MaybeTitleContent -> NormalText
    • Meaning: We are now parsing the text content of the title.
  5. Lexer emits LineEnd:

    • Action: Ends.lineEnd(NormalText)
    • Transition: NormalText -> Done
    • Effect: The parser detects the block is complete and returns the Block.

Edge Case Trace: #Invalid

What if the user types #Invalid (no space)?

  1. Start: Empty.
  2. Lexer emits Sign('#'):
    • Action: Sign.title(Empty)
    • Transition: Empty -> MaybeTitle(1)
  3. Lexer emits Str("Invalid"):
    • Action: unicodes(MaybeTitle(1))
    • Logic: The parser sees text immediately after # without a space. This violates the Markdown spec for headers.
    • Transition: MaybeTitle(1) -> NormalText
    • Recovery: The parser effectively “rewinds” by creating a NormalText span that includes both the previous # and the current “Invalid” string. It now treats it as a Paragraph starting with #.

This state recovery is crucial. We didn’t need to backtrack the Lexer; the Parser simply reinterpreted the meaning of the previous token based on the current one.

Why Zig?

  1. Tagged Unions: As seen in TokenItem and StateItem, Zig allows us to associate data with enum variants safely. This eliminates the need for void* casting or complex inheritance hierarchies often found in C++ parsers.
  2. Control Flow: Zig’s explicit control flow and lack of hidden allocations make it easy to reason about performance.
  3. Error Handling: The try keyword and error sets make handling unexpected tokens or structural errors consistent.

Conclusion

Writing a parser in Zig is a rewarding experience. The language features align well with the theoretical concepts of automata theory. By explicitly defining states and transitions, we create a parser that is both easy to extend and performant.

The mdz project demonstrates that you don’t always need parser generators (like Yacc/Bison) or complex parser combinators. A hand-written DFA in Zig is readable, type-safe, and fast.