Table of Contents

Public classSealed FSTBuilder

Namespace
Rowles.LeanLucene.Codecs.Fst
Assembly
Rowles.LeanLucene.dll

Builds a minimal acyclic finite state transducer (FST) from sorted byte-sequence keys with associated long outputs. Uses suffix-sharing (Daciuk et al.) to compress the automaton: identical sub-graphs are stored only once.

Usage:

var builder = new FSTBuilder();
builder.Add(key1Bytes, output1);
builder.Add(key2Bytes, output2);  // keys MUST be in lexicographic byte order
byte[] fstData = builder.Finish();

The returned byte[] can be written to disk and decoded by FSTReader.

Serialised format (documented for FSTReader):

  • Header (always present):
    1. 4 bytes — magic: ASCII FST1 (0x46 0x53 0x54 0x31)
    2. VarInt — root node address (absolute position in the node-data section, or -1 if empty)
    3. VarInt — key count
  • Node data: Nodes are written into a contiguous byte buffer. Each node is serialised as a sequence of arcs. Nodes are appended in the order they are compiled (deepest-suffix-first). The root node is the last to be compiled, so its address is stored in the header.
  • Arc layout (each arc):
    1. 1 byte — flags: [bit 7: isFinal][bit 6: isLastArc][bit 5: hasOutput][bit 4: hasTarget][bits 3-0: reserved]
    2. 1 byte — label (the byte value on this transition; 0x00 for virtual arcs on final-only nodes)
    3. If hasTarget: VarInt-encoded target address (absolute position in node-data section)
    4. If hasOutput: VarInt-encoded output value (long, 7-bit variable-length, unsigned encoding)
  • The isLastArc flag marks the final arc of a node (arcs are written consecutively; the reader scans until it encounters isLastArc == true).
  • Final-only nodes (accept states with no outgoing arcs) are encoded as a single virtual arc with label 0x00, flags isFinal | isLastArc, and an optional output (the final output).
  • Final flag on arcs: when a node is both final (accept state) and has outgoing arcs, the isFinal flag is set on the first arc. The node's FinalOutput is encoded as an additional virtual arc (label 0xFF) appended before the real arcs when the final output is non-zero. If the final output is zero, only the flag on the first real arc signals acceptance.

VarInt encoding: 7-bit variable-length, little-endian, compatible with System.IO.BinaryWriter.Write7BitEncodedInt64(System.Int64). Each byte stores 7 data bits; the high bit (0x80) is a continuation flag.

Algorithm — incremental construction (Daciuk et al.):

  1. Maintain a "frontier" — one FSTBuilder.UncompiledNode per byte of the current key prefix.
  2. When adding a new key, find the common prefix length with the previous key.
  3. For each frontier node beyond the common prefix, "freeze" (compile) it: check a registry of previously frozen nodes for an equivalent node (same arcs, outputs, final flag). If found, reuse that address (suffix sharing). Otherwise, serialise the node to the output buffer and register it.
  4. Extend the frontier for the new key's suffix.
  5. Distribute the output along arcs: push as much output as possible onto earlier (leftmost) arcs; differences are pushed down to child arcs.
  6. Finish() freezes remaining frontier nodes and returns the serialised FST.
public sealed class FSTBuilder
FSTBuilder

Public constructor FSTBuilder(int)

Creates a new FST builder with the specified initial buffer capacity.

Internal fieldInternal FlagHasOutput

Internal fieldInternal FlagHasTarget

Internal fieldInternal FlagIsFinal

Internal fieldInternal FlagIsLastArc

Public property Count

Number of keys added so far.

Internal propertyInternal HeaderMagic

Public method Add(ReadOnlySpan<byte>, long)

Add a key-output pair. Keys MUST be added in strict lexicographic (byte) sorted order. Duplicate keys are not permitted.

Public method Finish()

Finalise the FST and return the serialised byte array. The builder cannot be used after calling Finish().

Internal methodInternal ReadVarInt(ReadOnlySpan<byte>, ref int)

Reads a VarInt-encoded long from buffer starting at offset. Advances offset past the consumed bytes.

Internal methodInternal VarIntSize(long)

Returns the number of bytes needed to VarInt-encode value.

Internal methodInternal WriteVarInt(byte[], int, long)

Writes a long value as a 7-bit variable-length integer. Returns the number of bytes written.