Table of Contents

Public classSealed CJKBigramTokeniser

Namespace
Rowles.LeanLucene.Analysis.Tokenisers
Assembly
Rowles.LeanLucene.dll

Tokeniser for CJK (Chinese, Japanese, Korean) text using overlapping bigrams. Non-CJK text is tokenised by whitespace as standard. CJK characters produce overlapping 2-character tokens, which is the standard approach for unsegmented CJK text.

public sealed class CJKBigramTokeniser : ITokeniser
CJKBigramTokeniser
Implements

Methods

Public method Tokenise(ReadOnlySpan<char>)

Splits the input text into a list of tokens at word boundaries.