CJKBigramTokeniser

Namespace: Rowles.LeanLucene.Analysis.Tokenisers

Assembly: Rowles.LeanLucene.dll

Tokeniser for CJK (Chinese, Japanese, Korean) text using overlapping bigrams. Non-CJK text is tokenised by whitespace as standard. CJK characters produce overlapping 2-character tokens, which is the standard approach for unsegmented CJK text.

public sealed class CJKBigramTokeniser : ITokeniser

CJKBigramTokeniser

Implements: ITokeniser

Methods

Tokenise(ReadOnlySpan<char>): Splits the input text into a list of tokens at word boundaries.

Table of Contents

CJKBigramTokeniser

Methods