Table of Contents

Public namespace Rowles.LeanLucene.Analysis.Tokenisers

Classes

Public class CJKBigramTokeniser

Tokeniser for CJK (Chinese, Japanese, Korean) text using overlapping bigrams. Non-CJK text is tokenised by whitespace as standard. CJK characters produce overlapping 2-character tokens, which is the standard approach for unsegmented CJK text.

Public class EdgeNGramTokeniser

Splits text into character substrings of length [MinGram, MaxGram] anchored at the start of each whitespace-delimited token (edge n-grams).

Thread-safety: This class maintains an instance-level intern cache (_internCache) for performance. Each instance should be used by a single thread, or callers should create separate instances per thread.

Public class NGramTokeniser

Splits text into all contiguous character substrings of length in [MinGram, MaxGram]. Useful for partial-word matching and CJK text.

Thread-safety: This class maintains an instance-level intern cache (_internCache) for performance. Each instance should be used by a single thread, or callers should create separate instances per thread.

Public class Tokeniser

Slices input text into tokens at word boundaries, splitting on whitespace and punctuation whilst tracking character offsets.

Interfaces

Public interface ITokeniser

Splits input text into raw tokens.