
CJKBigramTokeniser
- Namespace
- Rowles.LeanLucene.Analysis.Tokenisers
- Assembly
- Rowles.LeanLucene.dll
Tokeniser for CJK (Chinese, Japanese, Korean) text using overlapping bigrams. Non-CJK text is tokenised by whitespace as standard. CJK characters produce overlapping 2-character tokens, which is the standard approach for unsegmented CJK text.
public sealed class CJKBigramTokeniser : ITokeniser
CJKBigramTokeniser
- Implements
Methods
Tokenise(ReadOnlySpan<char>)
Splits the input text into a list of tokens at word boundaries.