Namespace Rowles.LeanCorpus.Analysis.Tokenisers

Tokeniser for CJK (Chinese, Japanese, Korean) text using overlapping bigrams. Non-CJK text is tokenised by whitespace as standard. CJK characters produce overlapping 2-character tokens, which is the standard approach for unsegmented CJK text.

EdgeNGramTokeniser

Splits text into character substrings of length [MinGram, MaxGram] anchored at the start of each whitespace-delimited token (edge n-grams).

Thread-safety: This class maintains an instance-level intern cache (_internCache) for performance. Each instance should be used by a single thread, or callers should create separate instances per thread.

KeywordTokeniser

Treats the complete input as a single token.

LetterTokeniser

Splits input text into letter-only tokens, discarding digits and punctuation.

NGramTokeniser

Splits text into all contiguous character substrings of length in [MinGram, MaxGram]. Useful for partial-word matching and CJK text.

Thread-safety: This class maintains an instance-level intern cache (_internCache) for performance. Each instance should be used by a single thread, or callers should create separate instances per thread.

Tokeniser

Slices input text into tokens at word boundaries, splitting on whitespace and punctuation whilst tracking character offsets.

WhitespaceTokeniser

Splits input text into tokens separated only by whitespace.

ITokeniser

Splits input text into raw tokens.

Table of Contents

Rowles.LeanCorpus.Analysis.Tokenisers

Classes

Interfaces