Rowles.LeanLucene.Analysis.Tokenisers
Classes
CJKBigramTokeniser
Tokeniser for CJK (Chinese, Japanese, Korean) text using overlapping bigrams. Non-CJK text is tokenised by whitespace as standard. CJK characters produce overlapping 2-character tokens, which is the standard approach for unsegmented CJK text.
EdgeNGramTokeniser
Splits text into character substrings of length [MinGram, MaxGram] anchored at the start of each whitespace-delimited token (edge n-grams).
Thread-safety: This class maintains an instance-level intern cache (_internCache) for performance. Each instance should be used by a single thread, or callers should create separate instances per thread.
NGramTokeniser
Splits text into all contiguous character substrings of length in [MinGram, MaxGram]. Useful for partial-word matching and CJK text.
Thread-safety: This class maintains an instance-level intern cache (_internCache) for performance. Each instance should be used by a single thread, or callers should create separate instances per thread.
Tokeniser
Slices input text into tokens at word boundaries, splitting on whitespace and punctuation whilst tracking character offsets.
Interfaces
ITokeniser
Splits input text into raw tokens.