TokenIterators.jl provides easy syntax for writing lexers/tokenizers, with a few built-ins. It's super fast and easy to use.
Important
- This package is not designed for validating syntax.
- There's no guarantee that a token is the "smallest meaningful unit"
using TokenIterators
t = JSONTokens(b"""{ "key": "value", "key2": -1e-7}""")
collect(t)

A TokenIterator
(abstract type) iterates over Token
s (smallest meaningful unit of text/data) from any input T::AbstractVector{UInt8}
.
Both TokenIterator{T,K,S}
and Token{T,K,S}
are parameterized by the input data T
, type of token kind K
, and type of state S
.
struct Token{T <: AbstractVector{UInt8}, K, S} <: AbstractVector{UInt8}
data::T # Input Data
kind::K # Kind of Token (e.g. Symbol or an Enum type)
i::Int # first index of token
j::Int # last index of token
state::S # Any additional state we wish to store
end
Tip
StringViews.jl can be used to provide lightweight AbstractString views of the token.
A Token
can be created by a Rule
, which is created with the syntax:
starting_pattern --> ending_pattern
where:
starting_pattern
is used to identify the beginnign of a token.ending_pattern
is used to identify the end of a token.
Note
A Token
with indexes i
and j
will be created if starting_pattern == data[i]
and j == findnext(ending_pattern, data, i + 1)
.
We use TokenIterators.isfirst(starting_pattern, dataview)
and TokenIterators._findnext(ending_pattern, dataview, i + 1)
(to avoid piracy with Base.findnext
) to determine i
and j
of a Token
where dataview
is a view of the data after the previous token, e.g. @view data[prev_token.j + 1:end]
.
- Here is the full implementation of
TokenIterators.JSONTokens
:
struct JSONTokens{T <: Data} <: TokenIterator{T, Symbol, Nothing}
data::T
end
next(o::JSONTokens, n::Token) = @tryrules begin
curly_open = '{' --> 1
curly_close = '}' --> 1
square_open = '[' --> 1
square_close = ']' --> 1
comma = ',' --> 1
colon = ':' --> 1
var"true" = 't' --> 4
var"false" = 'f' --> 5
var"null" = 'n' --> 4
string = STRING # '"' --> Unescaped('"')
number = NUMBER # ∈(b"-0123456789") --> ≺(∉(b"-+eE.0123456789"))
whitespace = ASCII_WHITESPACE # ∈(b" \t\r\n") --> ≺(∉(b" \t\r\n"))
end
next
is the core part ofBase.iterate(::TokenIterator)
.- Here
n::Token
is a view into the data after the previous token, e.g.Token(prev.data, prev.kind, prev.j + 1, length(prev.data), prev.state)
. - The
@tryrules
macro will attempt to match the rules in order. If a rule matches, the token is created and the function returns.
Patterns in the DSL determine how the token is identified in data::AbstractVector{UInt8}
.
Type | Example | Symbol | Tab-Completion | Description |
---|---|---|---|---|
UInt8 |
0x7b |
x == data[i] |
||
Char |
'{' |
UInt8(x) == data[i] |
||
Function |
∈(b" \t\r\n") |
f(data[1]) == true |
||
Tuple |
(a,b) |
data starts with sequence of patterns |
||
AbstractVector{UInt8} |
b"null" |
x == data[1:length(x)] |
||
UseStringView |
𝑠('😄') |
𝑠 |
\its |
Use pattern with data::StringView |
Type | Example | Symbol | Tab-Completion | Description |
---|---|---|---|---|
UInt8 |
0x7b |
findnext(==(x), data, i + 1) |
||
Char |
'{' |
findnext(==(UInt8(x)), data, i + 1) |
||
Function |
∈(b" \t\r\n") |
findnext(f, data, i + 1) |
||
Int |
1 |
return x (fixed length) |
||
Tuple |
(a,b) |
return range in which indexes of data match sequence of patterns |
||
Last |
Last((a,b)) |
→ |
\rightarrow |
return last index of match indexes |
First |
First((a,b)) |
← |
\leftarrow |
return first index of match indexes |
Before |
Before(x -> x == a) |
≺ |
\prec |
return index before the match index |
UseStringView |
𝑠(isascii) |
𝑠 |
\its |
Use pattern with data::StringView |
Not |
¬('\\') |
¬ |
\neg |
Matching anything but the given pattern |
The STRING
, NUMBER
, and ASCII_WHITESPACE
patterns from the JSONTokens example are interpreted as:
# STRING: Token begins with UInt8('"') and ends at UInt8('"') (excluding escaped '"')
'"' --> Unescaped('"')
# NUMBER: Token begins with any byte in b"-0123456789" and ends at the index before the first byte that is not in b"-+eE.0123456789"
∈(b"-0123456789") --> ≺(∉(b"-+eE.0123456789"))
# ASCII_WHITESPACE: Token begins with any byte in b" \t\r\n" and ends at the index before the first byte that is not in b" \t\r\n"
∈(b" \t\r\n") --> ≺(∉(b" \t\r\n"))
TokenIterators is very fast with minimal allocations:
using TokenIterators, BenchmarkTools
versioninfo()
# Julia Version 1.11.3
# Commit d63adeda50d (2025-01-21 19:42 UTC)
# Build Info:
# Official https://julialang.org/ release
# Platform Info:
# OS: macOS (arm64-apple-darwin24.0.0)
# CPU: 10 × Apple M1 Pro
# WORD_SIZE: 64
# LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
# Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
data = read(download("https://github.com/plotly/plotly.js/raw/v3.0.1/dist/plot-schema.json"));
Base.format_bytes(length(data))
# "3.648 MiB"
t = JSONTokens(data)
# JSONTokens (3824728-element Vector{UInt8})
@benchmark sum(t.kind == :string for t in $t)
# BenchmarkTools.Trial: 497 samples with 1 evaluation per sample.
# Range (min … max): 9.834 ms … 31.878 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 10.001 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 10.059 ms ± 993.690 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▁▁▄▄▂▂▄▁ ▃▄ ▂▄▅█▁▄▃ ▄▃▃▆▂▁▂▂
# ▆▃████████▆▆▅▆██████████▆█████████▆▅▆▆▆▆▅▆▃▃▄▃▃▅▄▃▅▁▃▃▃▃▃▁▃▃ ▅
# 9.83 ms Histogram: frequency by time 10.3 ms <
# Memory estimate: 0 bytes, allocs estimate: 0.