Write Markdown Lexer and Parser in Zig
Table of Contents
Open Table of Contents
Introduction
Markdown is a lightweight markup language with plain text formatting syntax. Writing a parser for it is a fascinating challenge because the syntax is context-sensitive (e.g., indentation matters, newlines behave differently in different blocks).
In this post, we’ll explore how to build a Markdown lexer and parser in Zig. We will leverage Zig’s powerful type system, specifically Tagged Unions (Sum Types) and Exhaustive Switch statements, to implement a robust Deterministic Finite Automaton (DFA).
Design Philosophy: Simple Lexer, Complex Parser
To understand why we made these choices, let’s look at the actual data flow in mdz. The architecture is built around a unidirectional pipeline where the Lexer feeds tokens into a stateful Parser.
Data Flow
+--------------+ +--------------+ +--------------+
| Source Code | ----> | Lexer | ----> | Parser | ----> AST (Blocks)
+--------------+ +--------------+ +--------------+
| ^
| |
[Context-Free] [Context-Aware]
- Tab - State: Empty
- Space - State: MaybeTitle
- Sign (#, *, etc) - State: CodeBlock
- Str - ...
The Lexer is intentionally designed to be “dumb”. It doesn’t know if a # is a header or a comment; it just sees a Sign. This simplicity allows the Lexer to be extremely fast, potentially using SIMD instructions to scan bytes in parallel.
The Parser, on the other hand, holds all the context. It maintains a State that evolves as it consumes tokens. This is where the complexity lives, as Markdown is context-sensitive.
The Parser Loop
Here is the core loop from src/parser.zig. Notice how it takes a token from the lexer, passes it to the DFA transition function (DFA.f), and checks if the state has reached a Done condition to emit a Block (like a Paragraph or Title).
pub fn next(self: *Self, lexer: *Lexer) ParseError!?Block {
// 1. Initialize or recover state
var state = if (self.recover_state) |state|
State{ .state = state, .allocator = self.allocator }
else
State.empty(self.allocator);
// 2. Loop through tokens
while (lexer.next()) |value| {
switch (value.item) {
.ok => |token| {
// 3. Apply Transition Function: f(State, Token) -> NewState
try DFA.f(&state, token, value.span);
// 4. Check if a Block is complete
switch (state.state) {
.Done => {
self.recover_state = state.recover_state;
return state.value;
},
else => {
continue;
},
}
},
// ... error handling ...
}
}
return null;
}
By decoupling the token generation from the semantic interpretation, we make the system easier to test and optimize. The Lexer is just a fancy iterator, while the Parser contains the pure state machine logic.
Pull vs Push Architecture
The architecture described above is a Pull-based system. The Parser controls the flow by calling lexer.next() whenever it is ready to process a new token.
In a Push-based system, the Lexer would drive the process, reading the file continuously and “pushing” tokens into the Parser (e.g., via callbacks or an event queue).
We chose the Pull model because:
- Parser-Centric Control: The Parser is the complex part. Giving it control means it can easily pause, look ahead, or stop after parsing a single block (as seen in
Parser.nextreturning?Block). - Zero Overhead: Zig excels at simple, synchronous control flow. A Pull lexer is just a function call. A Push lexer often implies inversion of control (callbacks) or concurrency overhead (channels), which adds unnecessary complexity for a single-threaded parser.
- State Management: In Markdown, the boundary between blocks isn’t always clear until you parse the content. The Pull model allows the Parser to maintain its state stack and effectively “pull” the structure out of the flat stream of tokens.
AST Representation: BitFlags vs. Trees
Most parsers represent Markdown as a tree of nodes (e.g., a Strong node containing a Text node). While expressive, this involves significant pointer chasing and memory allocation.
mdz takes a different approach for inline formatting. Instead of a tree, we use BitFlags.
We define a set of decorations:
pub const TextKind = enum(u8) {
Bold,
Italic,
Strikethrough,
LaTex,
Code,
};
pub const Decorations = std.bit_set.StaticBitSet(5);
pub const Text = struct {
decorations: Decorations, // e.g., { Bold, Italic }
span: Span,
};
When the parser encounters **bold *italic***, it doesn’t build a nested structure. Instead, it flattens the text into spans with attributes:
"bold "->[Bold]"italic"->[Bold, Italic]
Trade-offs:
- Traditional Tree AST: Good for complex nesting where parent nodes need specific attributes. Heavy on memory.
- Flat AST: Better cache locality, but traversing structure is still complex.
- BitFlags (mdz): Extremely compact and cache-friendly. Text nodes are just a Span (2 integers) and a BitSet. However, it cannot easily represent complex, non-hierarchical overlapping (which Markdown avoids) or attributes on styles (e.g.,
spanwith class), though standard Markdown doesn’t strictly require that for basic styles.
This representation drastically reduces the memory footprint and simplifies the rendering logic (just check the bits and emit tags).
The Lexer: From Bytes to Tokens
The first step is to tokenize the input. Instead of working with raw characters in the parser, we categorize them into tokens. This simplifies the parser’s job.
In src/lexer.zig, we define the tokens. Zig’s union(enum) is perfect for this, as it allows tokens to carry data (payloads) only when necessary.
pub const TokenItemTag = enum {
Tab,
Space,
LineEnd,
Sign,
AsciiNumber,
Str,
EOF,
};
pub const TokenItem = union(TokenItemTag) {
const Self = @This();
Tab: void,
Space: void,
LineEnd: void,
Sign: u8,
// Using slice to represent number content
AsciiNumber: []const u8,
Str: []const u8,
EOF: void,
// ... constructors (e.g., TokenItem.sign('#'))
};
The Lexer struct holds the state of the text traversal. We use std.unicode.Utf8Iterator to handle UTF-8 decoding seamlessly.
pub const Lexer = struct {
buffer: []const u8,
index: usize = 0,
utf8_iterator: UTF8Iterator,
pub fn next(self: *Self) ?Token {
// ... check buffer bounds ...
if (self.next_code()) |code|
switch (code) {
0x9 => self.tab(),
'\n' => self.lineEnd(false),
'0'...'9' => {
// Logic to consume full number
},
// Handle markdown special characters
0x21...0x2F, 0x3A...0x40, 0x5B...0x60, 0x7B...0x7E => |c| self.sign(@intCast(c)),
else => {
// Consume normal string
},
}
else
null;
// ... update index ...
}
};
This separation allows the parser to focus on structure rather than byte-level decoding.
The Parser: A State Machine (DFA)
The core logic of our parser lies in the State Machine. Markdown parsing depends heavily on “what we have seen so far”. For example, a # might be a header, or it might be just text inside a code block.
We define a StateKind enum and a corresponding StateItem union to hold context-specific data.
pub const StateKind = enum {
Empty,
MaybeTitle,
MaybeTitleContent,
MaybeCodeBlock,
MaybeThematicBreak,
NormalText,
Done,
// ... and many more
};
pub const StateItem = union(StateKind) {
Empty: void,
// For titles, we need to know the level (number of #)
MaybeTitle: usize,
MaybeTitleContent: usize,
// For text, we track the span
NormalText: Span,
Done: void,
// ... helpers
};
The Transition Function
The heart of a DFA is the transition function: (CurrentState, Input) -> NextState. In our implementation (src/dfa/lib.zig), this is the f function.
pub fn f(s: *State, token: Token, span: Span) ParseError!ReturnType {
try switch (token) {
.Sign => |sign| switch (sign) {
'`' => Sign.code(s, span),
'#' => Sign.title(s, span),
'*' => Sign.boldItalic(s, span),
// ... dispatch to specific handlers
else => Sign.normal(s, span),
},
.Tab => Ends.tab(s, span),
.Space => Ends.space(s, span),
.LineEnd => Ends.lineEnd(s, span),
.Str => |str| unicodes(s, str, span),
// ...
};
}
Zig’s switch statements ensure that we handle every token type. If we add a new token type, the compiler will force us to update this logic.
Example: Parsing Titles
Let’s look at how we handle titles (# Title) in src/dfa/sign/title.zig.
When the parser encounters a # token, it calls Sign.title. Inside, we check the current state to decide what to do.
pub fn title(state: *State, span: Span) Error!void {
switch (state.state) {
// If we are at the start of a block
.Empty => {
// Transition to MaybeTitle state with count 1
state.state = .{ .MaybeTitle = 1 };
},
// If we are already parsing a title prefix (e.g., "##")
.MaybeTitle => |*size| {
// Increment the level
size.* += 1;
},
// If we are inside a code block
.MaybeFencedCodeBegin => |s| {
// Treat # as normal text/code
try state.value.?.Paragraph.addCode(state.allocator, ...);
state.toNormalText(span);
},
// Default fallback: treat as normal text
.NormalText => |*s| {
_ = s.enlarge(span.len);
},
// ...
}
}
This pattern matches the DFA formal definition perfectly. We check the current state (e.g., MaybeTitle) and the input (implied by the function being title), and we transition to a new state or update the current one (accumulating #s).
Step-by-Step Trace: # Hello
To see it in action, let’s trace the full parsing lifecycle of the string # Hello.
-
Start:
- State:
Empty - Value:
null
- State:
-
Lexer emits
Sign('#'):- Action:
Sign.title(Empty) - Transition:
Empty->MaybeTitle(1) - Meaning: We found one hash, it might be a title.
- Action:
-
Lexer emits
Space:- Action:
Ends.space(MaybeTitle(1)) - Transition:
MaybeTitle(1)->MaybeTitleContent(1) - Meaning: The hashes ended with a space, so it is a valid H1 title.
- Action:
-
Lexer emits
Str("Hello"):- Action:
unicodes(MaybeTitleContent(1)) - Effect:
state.valueis initialized asBlock.Title(level=1). - Transition:
MaybeTitleContent->NormalText - Meaning: We are now parsing the text content of the title.
- Action:
-
Lexer emits
LineEnd:- Action:
Ends.lineEnd(NormalText) - Transition:
NormalText->Done - Effect: The parser detects the block is complete and returns the
Block.
- Action:
Edge Case Trace: #Invalid
What if the user types #Invalid (no space)?
- Start:
Empty. - Lexer emits
Sign('#'):- Action:
Sign.title(Empty) - Transition:
Empty->MaybeTitle(1)
- Action:
- Lexer emits
Str("Invalid"):- Action:
unicodes(MaybeTitle(1)) - Logic: The parser sees text immediately after
#without a space. This violates the Markdown spec for headers. - Transition:
MaybeTitle(1)->NormalText - Recovery: The parser effectively “rewinds” by creating a
NormalTextspan that includes both the previous#and the current “Invalid” string. It now treats it as a Paragraph starting with#.
- Action:
This state recovery is crucial. We didn’t need to backtrack the Lexer; the Parser simply reinterpreted the meaning of the previous token based on the current one.
Why Zig?
- Tagged Unions: As seen in
TokenItemandStateItem, Zig allows us to associate data with enum variants safely. This eliminates the need forvoid*casting or complex inheritance hierarchies often found in C++ parsers. - Control Flow: Zig’s explicit control flow and lack of hidden allocations make it easy to reason about performance.
- Error Handling: The
trykeyword and error sets make handling unexpected tokens or structural errors consistent.
Conclusion
Writing a parser in Zig is a rewarding experience. The language features align well with the theoretical concepts of automata theory. By explicitly defining states and transitions, we create a parser that is both easy to extend and performant.
The mdz project demonstrates that you don’t always need parser generators (like Yacc/Bison) or complex parser combinators. A hand-written DFA in Zig is readable, type-safe, and fast.