Table of Contents

Public namespace Rowles.LeanLucene.Codecs.Fst

Classes

Internal classInternal FSTAutomaton

Provides automaton intersection with the term dictionary. Walks sorted terms while simultaneously advancing the automaton, pruning branches where CanMatch(int) returns false.

Public class FSTBuilder

Builds a minimal acyclic finite state transducer (FST) from sorted byte-sequence keys with associated long outputs. Uses suffix-sharing (Daciuk et al.) to compress the automaton: identical sub-graphs are stored only once.

Usage:

var builder = new FSTBuilder();
builder.Add(key1Bytes, output1);
builder.Add(key2Bytes, output2);  // keys MUST be in lexicographic byte order
byte[] fstData = builder.Finish();

The returned byte[] can be written to disk and decoded by FSTReader.

Serialised format (documented for FSTReader):

  • Header (always present):
    1. 4 bytes — magic: ASCII FST1 (0x46 0x53 0x54 0x31)
    2. VarInt — root node address (absolute position in the node-data section, or -1 if empty)
    3. VarInt — key count
  • Node data: Nodes are written into a contiguous byte buffer. Each node is serialised as a sequence of arcs. Nodes are appended in the order they are compiled (deepest-suffix-first). The root node is the last to be compiled, so its address is stored in the header.
  • Arc layout (each arc):
    1. 1 byte — flags: [bit 7: isFinal][bit 6: isLastArc][bit 5: hasOutput][bit 4: hasTarget][bits 3-0: reserved]
    2. 1 byte — label (the byte value on this transition; 0x00 for virtual arcs on final-only nodes)
    3. If hasTarget: VarInt-encoded target address (absolute position in node-data section)
    4. If hasOutput: VarInt-encoded output value (long, 7-bit variable-length, unsigned encoding)
  • The isLastArc flag marks the final arc of a node (arcs are written consecutively; the reader scans until it encounters isLastArc == true).
  • Final-only nodes (accept states with no outgoing arcs) are encoded as a single virtual arc with label 0x00, flags isFinal | isLastArc, and an optional output (the final output).
  • Final flag on arcs: when a node is both final (accept state) and has outgoing arcs, the isFinal flag is set on the first arc. The node's FinalOutput is encoded as an additional virtual arc (label 0xFF) appended before the real arcs when the final output is non-zero. If the final output is zero, only the flag on the first real arc signals acceptance.

VarInt encoding: 7-bit variable-length, little-endian, compatible with System.IO.BinaryWriter.Write7BitEncodedInt64(System.Int64). Each byte stores 7 data bits; the high bit (0x80) is a continuation flag.

Algorithm — incremental construction (Daciuk et al.):

  1. Maintain a "frontier" — one FSTBuilder.UncompiledNode per byte of the current key prefix.
  2. When adding a new key, find the common prefix length with the previous key.
  3. For each frontier node beyond the common prefix, "freeze" (compile) it: check a registry of previously frozen nodes for an equivalent node (same arcs, outputs, final flag). If found, reuse that address (suffix sharing). Otherwise, serialise the node to the output buffer and register it.
  4. Extend the frontier for the new key's suffix.
  5. Distribute the output along arcs: push as much output as possible onto earlier (leftmost) arcs; differences are pushed down to child arcs.
  6. Finish() freezes remaining frontier nodes and returns the serialised FST.
Internal classInternal FSTReader

Reads a v2 .dic file: compact byte-keyed sorted dictionary. All data is loaded into contiguous arrays at open time — no per-term string allocation. Binary search operates on raw UTF-8 bytes (~3× faster than char-span comparison).

Public class LevenshteinAutomaton

Levenshtein DFA accepting strings within edit distance maxEdits of a reference term. Built via NFA-to-DFA subset construction at construction time to correctly handle deletions (ε-transitions that advance position without consuming input). Operates on UTF-8 bytes.

Public class PrefixAutomaton

Accepts any byte sequence that starts with a given UTF-8 prefix. States: 0..prefix.Length-1 = matching prefix bytes, prefix.Length = accepted (sink).

Internal classInternal TermDictionaryFstBuilder

Serialises a sorted term list into a compact byte-keyed v2 dictionary format. Format: [termCount:int32][postingsOffsets: N*int64][keyStarts: (N+1)*int32][keyData: UTF-8 bytes]. This format enables O(log N) binary search on raw UTF-8 bytes without string materialisation. Terms are re-sorted in UTF-8 byte order to ensure binary search correctness (string ordinal sort can differ from UTF-8 byte sort for supplementary characters).

Public class WildcardAutomaton

DFA for wildcard patterns with '*' (any sequence) and '?' (any single byte). Built via NFA-to-DFA subset construction at construction time. Operates on individual UTF-8 bytes (not characters).

Interfaces

Public interface IAutomaton

Interface for deterministic finite automata used with term dictionary intersection. States are represented as integers. State -1 is the dead/reject state.