Rowles.LeanLucene.Analysis.Filters

Classes

AccentFoldingFilter: Normalises accented/diacritic characters to their ASCII base form (e.g., é→e, ñ→n, ü→u) for language-neutral matching. Uses Unicode canonical decomposition followed by stripping combining marks.

HtmlStripCharFilter: Strips HTML/XML tags from input text, leaving only the text content.

LowercaseFilter: Performs an in-place lowercase transformation on tokens or a character buffer.

MappingCharFilter: Maps specific characters or strings to replacements using a lookup table. Useful for normalising special characters (e.g., smart quotes → straight quotes).

PatternReplaceCharFilter: Replaces text matching a regex pattern with a replacement string.

PorterStemmerFilter: Porter Stemming Algorithm implementation as an ITokenFilter. Based on the Porter 1980 specification for English stemming. Operates on tokens in-place, replacing text with stemmed form.

StopWordFilter: Removes common English stop words from a token list using a frozen set for fast, allocation-free lookups.

SynonymGraphFilter: Token filter that supports multi-token synonym expansion using a trie-based SynonymMap. Uses longest-match lookahead for multi-word synonyms and inserts replacement tokens at the same position offsets.

SynonymMap: Trie-based synonym map supporting multi-token source phrases. Used by SynonymGraphFilter for longest-match multi-token synonym expansion.

Interfaces

ICharFilter: Interface for character-level filters that transform raw text before tokenisation. Char filters run before the tokeniser, operating on the entire input string.