Rowles.LeanLucene.Codecs.Fst
Classes
FSTAutomaton
Provides automaton intersection with the term dictionary. Walks sorted terms while simultaneously advancing the automaton, pruning branches where CanMatch(int) returns false.
FSTBuilder
Builds a minimal acyclic finite state transducer (FST) from sorted byte-sequence keys with associated long outputs. Uses suffix-sharing (Daciuk et al.) to compress the automaton: identical sub-graphs are stored only once.
Usage:
var builder = new FSTBuilder(); builder.Add(key1Bytes, output1); builder.Add(key2Bytes, output2); // keys MUST be in lexicographic byte order byte[] fstData = builder.Finish();The returned
byte[]can be written to disk and decoded byFSTReader.Serialised format (documented for FSTReader):
-
Header (always present):
- 4 bytes — magic: ASCII
FST1(0x46 0x53 0x54 0x31) - VarInt — root node address (absolute position in the node-data section, or
-1if empty) - VarInt — key count
- 4 bytes — magic: ASCII
- Node data: Nodes are written into a contiguous byte buffer. Each node is serialised as a sequence of arcs. Nodes are appended in the order they are compiled (deepest-suffix-first). The root node is the last to be compiled, so its address is stored in the header.
-
Arc layout (each arc):
- 1 byte — flags:
[bit 7: isFinal][bit 6: isLastArc][bit 5: hasOutput][bit 4: hasTarget][bits 3-0: reserved] - 1 byte — label (the byte value on this transition; 0x00 for virtual arcs on final-only nodes)
- If
hasTarget: VarInt-encoded target address (absolute position in node-data section) - If
hasOutput: VarInt-encoded output value (long, 7-bit variable-length, unsigned encoding)
- 1 byte — flags:
-
The
isLastArcflag marks the final arc of a node (arcs are written consecutively; the reader scans until it encountersisLastArc == true). -
Final-only nodes (accept states with no outgoing arcs) are encoded as a single
virtual arc with label
0x00, flagsisFinal | isLastArc, and an optional output (the final output). -
Final flag on arcs: when a node is both final (accept state) and has outgoing arcs,
the
isFinalflag is set on the first arc. The node'sFinalOutputis encoded as an additional virtual arc (label0xFF) appended before the real arcs when the final output is non-zero. If the final output is zero, only the flag on the first real arc signals acceptance.
VarInt encoding: 7-bit variable-length, little-endian, compatible with System.IO.BinaryWriter.Write7BitEncodedInt64(System.Int64). Each byte stores 7 data bits; the high bit (
0x80) is a continuation flag.Algorithm — incremental construction (Daciuk et al.):
- Maintain a "frontier" — one FSTBuilder.UncompiledNode per byte of the current key prefix.
- When adding a new key, find the common prefix length with the previous key.
- For each frontier node beyond the common prefix, "freeze" (compile) it: check a registry of previously frozen nodes for an equivalent node (same arcs, outputs, final flag). If found, reuse that address (suffix sharing). Otherwise, serialise the node to the output buffer and register it.
- Extend the frontier for the new key's suffix.
- Distribute the output along arcs: push as much output as possible onto earlier (leftmost) arcs; differences are pushed down to child arcs.
- Finish() freezes remaining frontier nodes and returns the serialised FST.
-
Header (always present):
FSTReader
Reads a v2 .dic file: compact byte-keyed sorted dictionary. All data is loaded into contiguous arrays at open time — no per-term string allocation. Binary search operates on raw UTF-8 bytes (~3× faster than char-span comparison).
LevenshteinAutomaton
Levenshtein DFA accepting strings within edit distance
maxEditsof a reference term. Built via NFA-to-DFA subset construction at construction time to correctly handle deletions (ε-transitions that advance position without consuming input). Operates on UTF-8 bytes.
PrefixAutomaton
Accepts any byte sequence that starts with a given UTF-8 prefix. States: 0..prefix.Length-1 = matching prefix bytes, prefix.Length = accepted (sink).
TermDictionaryFstBuilder
Serialises a sorted term list into a compact byte-keyed v2 dictionary format. Format: [termCount:int32][postingsOffsets: N*int64][keyStarts: (N+1)*int32][keyData: UTF-8 bytes]. This format enables O(log N) binary search on raw UTF-8 bytes without string materialisation. Terms are re-sorted in UTF-8 byte order to ensure binary search correctness (string ordinal sort can differ from UTF-8 byte sort for supplementary characters).
WildcardAutomaton
DFA for wildcard patterns with '*' (any sequence) and '?' (any single byte). Built via NFA-to-DFA subset construction at construction time. Operates on individual UTF-8 bytes (not characters).
Interfaces
IAutomaton
Interface for deterministic finite automata used with term dictionary intersection. States are represented as integers. State -1 is the dead/reject state.