Stop words and the token budget
Stop words
StopWordFilter drops tokens that match a supplied set. The library ships
StopWords.English and equivalents for other supported languages.
var filter = new StopWordFilter(StopWords.English);
Stop word removal happens at analysis time. Both index-time and query-time analysers must apply the same set, otherwise removed query terms produce zero hits.
The token budget
Per-document token count is capped via IndexWriterConfig.MaxTokensPerDocument
(default 0, unlimited). When the cap is hit, behaviour is governed by
TokenBudgetPolicy:
Truncate(default) — silently stop processing additional tokens.Reject— throw, refusing the document.
var config = new IndexWriterConfig
{
MaxTokensPerDocument = 100_000,
TokenBudgetPolicy = TokenBudgetPolicy.Truncate,
};
A finite budget is useful when ingesting unknown user content where pathological documents would otherwise dominate buffer memory.