TokenIterators

TokenIterators.jl provides easy syntax for writing lexers/tokenizers, with a few built-ins. It's super fast and easy to use.

Important

This package is not designed for validating syntax.
There's no guarantee that a token is the "smallest meaningful unit"

Usage

using TokenIterators

t = JSONTokens(b"""{ "key": "value", "key2": -1e-7}""")

collect(t)

`TokenIterator` and `Token`

A TokenIterator (abstract type) iterates over Tokens (smallest meaningful unit of text/data) from any input T::AbstractVector{UInt8}.

Both TokenIterator{T,K,S} and Token{T,K,S} are parameterized by the input data T, type of token kind K, and type of state S.

struct Token{T <: AbstractVector{UInt8}, K, S} <: AbstractVector{UInt8}
    data::T     # Input Data
    kind::K     # Kind of Token (e.g. Symbol or an Enum type)
    i::Int      # first index of token
    j::Int      # last index of token
    state::S    # Any additional state we wish to store
end

Tip

StringViews.jl can be used to provide lightweight AbstractString views of the token.

Rules

A Token can be created by a Rule, which is created with the syntax:

starting_pattern --> ending_pattern

where:

starting_pattern is used to identify the beginnign of a token.
ending_pattern is used to identify the end of a token.

Note

A Token with indexes i and j will be created if starting_pattern == data[i] and j == findnext(ending_pattern, data, i + 1). We use TokenIterators.isfirst(starting_pattern, dataview) and TokenIterators._findnext(ending_pattern, dataview, i + 1) (to avoid piracy with Base.findnext) to determine i and j of a Token where dataview is a view of the data after the previous token, e.g. @view data[prev_token.j + 1:end].

An Example: JSONTokens

Here is the full implementation of TokenIterators.JSONTokens:

struct JSONTokens{T <: Data} <: TokenIterator{T, Symbol, Nothing}
    data::T
end

next(o::JSONTokens, n::Token) = @tryrules begin
    curly_open      = '{' --> 1
    curly_close     = '}' --> 1
    square_open     = '[' --> 1
    square_close    = ']' --> 1
    comma           = ',' --> 1
    colon           = ':' --> 1
    var"true"       = 't' --> 4
    var"false"      = 'f' --> 5
    var"null"       = 'n' --> 4
    string          = STRING            # '"'                -->  Unescaped('"')
    number          = NUMBER            # ∈(b"-0123456789")  -->  ≺(∉(b"-+eE.0123456789"))
    whitespace      = ASCII_WHITESPACE  #  ∈(b" \t\r\n")     -->  ≺(∉(b" \t\r\n"))
end

next is the core part of Base.iterate(::TokenIterator).
Here n::Token is a view into the data after the previous token, e.g. Token(prev.data, prev.kind, prev.j + 1, length(prev.data), prev.state).
The @tryrules macro will attempt to match the rules in order. If a rule matches, the token is created and the function returns.

Mini-DSL

Patterns in the DSL determine how the token is identified in data::AbstractVector{UInt8}.

Starting Patterns

Type	Example	Symbol	Tab-Completion	Description
`UInt8`	`0x7b`			`x == data[i]`
`Char`	`'{'`			`UInt8(x) == data[i]`
`Function`	`∈(b" \t\r\n")`			`f(data[1]) == true`
`Tuple`	`(a,b)`			`data` starts with sequence of patterns
`AbstractVector{UInt8}`	`b"null"`			`x == data[1:length(x)]`
`UseStringView`	`𝑠('😄')`	`𝑠`	`\its`	Use pattern with `data::StringView`

Ending Patterns

Type	Example	Symbol	Tab-Completion	Description
`UInt8`	`0x7b`			`findnext(==(x), data, i + 1)`
`Char`	`'{'`			`findnext(==(UInt8(x)), data, i + 1)`
`Function`	`∈(b" \t\r\n")`			`findnext(f, data, i + 1)`
`Int`	`1`			`return x` (fixed length)
`Tuple`	`(a,b)`			return range in which indexes of `data` match sequence of patterns
`Last`	`Last((a,b))`	`→`	`\rightarrow`	return last index of match indexes
`First`	`First((a,b))`	`←`	`\leftarrow`	return first index of match indexes
`Before`	`Before(x -> x == a)`	`≺`	`\prec`	return index before the match index
`UseStringView`	`𝑠(isascii)`	`𝑠`	`\its`	Use pattern with `data::StringView`
`Not`	`¬('\\')`	`¬`	`\neg`	Matching anything but the given pattern

Mini-DSL Examples

The STRING, NUMBER, and ASCII_WHITESPACE patterns from the JSONTokens example are interpreted as:

# STRING: Token begins with UInt8('"') and ends at UInt8('"') (excluding escaped '"')
'"' --> Unescaped('"')

# NUMBER: Token begins with any byte in b"-0123456789" and ends at the index before the first byte that is not in b"-+eE.0123456789"
∈(b"-0123456789")  -->  ≺(∉(b"-+eE.0123456789"))

# ASCII_WHITESPACE: Token begins with any byte in b" \t\r\n" and ends at the index before the first byte that is not in b" \t\r\n"
∈(b" \t\r\n")  -->  ≺(∉(b" \t\r\n"))

Performance

TokenIterators is very fast with minimal allocations:

using TokenIterators, BenchmarkTools

versioninfo()
# Julia Version 1.11.3
# Commit d63adeda50d (2025-01-21 19:42 UTC)
# Build Info:
#   Official https://julialang.org/ release
# Platform Info:
#   OS: macOS (arm64-apple-darwin24.0.0)
#   CPU: 10 × Apple M1 Pro
#   WORD_SIZE: 64
#   LLVM: libLLVM-16.0.6 (ORCJIT, apple-m1)
# Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)

data = read(download("https://github.com/plotly/plotly.js/raw/v3.0.1/dist/plot-schema.json"));

Base.format_bytes(length(data))
# "3.648 MiB"

t = JSONTokens(data)
# JSONTokens (3824728-element Vector{UInt8})

@benchmark sum(t.kind == :string for t in $t)
# BenchmarkTools.Trial: 497 samples with 1 evaluation per sample.
#  Range (min … max):   9.834 ms …  31.878 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     10.001 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   10.059 ms ± 993.690 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

#     ▁▁▄▄▂▂▄▁    ▃▄ ▂▄▅█▁▄▃ ▄▃▃▆▂▁▂▂
#   ▆▃████████▆▆▅▆██████████▆█████████▆▅▆▆▆▆▅▆▃▃▄▃▃▅▄▃▅▁▃▃▃▃▃▁▃▃ ▅
#   9.83 ms         Histogram: frequency by time         10.3 ms <

#  Memory estimate: 0 bytes, allocs estimate: 0.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.github		.github
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TokenIterators

Usage

`TokenIterator` and `Token`

Rules

An Example: JSONTokens

Mini-DSL

Starting Patterns

Ending Patterns

Mini-DSL Examples

Performance

About

Releases 1

Languages

License

joshday/TokenIterators.jl

Folders and files

Latest commit

History

Repository files navigation

TokenIterators

Usage

TokenIterator and Token

Rules

An Example: JSONTokens

Mini-DSL

Starting Patterns

Ending Patterns

Mini-DSL Examples

Performance

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Languages

`TokenIterator` and `Token`