Rowles.LeanCorpus.Analysis.Tokenisers
Classes
CJKBigramTokeniser
Tokeniser for CJK (Chinese, Japanese, Korean) text using overlapping bigrams. Non-CJK text is tokenised by whitespace as standard. CJK characters produce overlapping 2-character tokens, which is the standard approach for unsegmented CJK text.
EdgeNGramTokeniser
Splits text into character substrings of length [MinGram, MaxGram] anchored at the start of each whitespace-delimited token (edge n-grams).
Thread-safety: This class maintains an instance-level intern cache (_internCache) for performance. Each instance should be used by a single thread, or callers should create separate instances per thread.
KeywordTokeniser
Treats the complete input as a single token.
LetterTokeniser
Splits input text into letter-only tokens, discarding digits and punctuation.
NGramTokeniser
Splits text into all contiguous character substrings of length in [MinGram, MaxGram]. Useful for partial-word matching and CJK text.
Thread-safety: This class maintains an instance-level intern cache (_internCache) for performance. Each instance should be used by a single thread, or callers should create separate instances per thread.
Tokeniser
Slices input text into tokens at word boundaries, splitting on whitespace and punctuation whilst tracking character offsets.
WhitespaceTokeniser
Splits input text into tokens separated only by whitespace.
Interfaces
ITokeniser
Splits input text into raw tokens.