-
Notifications
You must be signed in to change notification settings - Fork 0
stemming
Status: ✅ Implemented (v1.1) – Per-Index Configuration
Themis supports optional stemming for fulltext indexes to improve text matching by reducing words to their root form. This increases recall by matching different word forms (e.g., "running" matches "run", "runs").
| Language | Code | Algorithm | Examples |
|---|---|---|---|
| English | en |
Porter Subset | running→run, cats→cat, played→play |
| German | de |
Suffix Removal | laufen→lauf, machte→macht, gruppen→grupp |
| None | none |
No stemming | Exact token matching only |
HTTP API:
POST /index/create
{
"table": "articles",
"column": "content",
"type": "fulltext",
"config": {
"stemming_enabled": true,
"language": "de",
"stopwords_enabled": true // optional: Stopwords entfernen
}
}C++ API:
SecondaryIndexManager::FulltextConfig config;
config.stemming_enabled = true;
config.language = "en";
auto status = indexMgr.createFulltextIndex("articles", "content", config);POST /index/create
{
"table": "articles",
"column": "content",
"type": "fulltext"
}
# Equivalent to:
# "config": {"stemming_enabled": false, "language": "none"}When a document is indexed with stemming enabled:
- Tokenization: Text is split on whitespace and punctuation
- Lowercase: Tokens are converted to lowercase
- Stemming: Tokens are reduced to their stem form (if enabled)
- Storage: Stemmed tokens are stored in the inverted index
Note: If stopwords are enabled, stopwords are filtered out before stemming. If umlaut normalization is enabled (German), normalization occurs before tokenization.
Example (English):
Input: "Machine learning algorithms are optimizing systems"
Tokens: ["machine", "learning", "algorithms", "are", "optimizing", "systems"]
Stems: ["machin", "learn", "algorithm", "are", "optim", "system"]
Example (German):
Input: "Die Maschinen lernen aus vergangenen Fehlern"
Tokens: ["die", "maschinen", "lernen", "aus", "vergangenen", "fehlern"]
Stems: ["die", "maschin", "lern", "aus", "vergangen", "fehl"]
When searching with stemming enabled:
- Query tokens are processed identically to index tokens
- Stemmed query terms match stemmed index terms
- BM25 scoring uses stemmed token statistics
Example Query:
POST /search/fulltext
{
"table": "articles",
"column": "content",
"query": "learning optimization",
"limit": 10
}With stemming enabled (language: "en"):
- Query stems to:
["learn", "optim"] - Matches documents containing: "learning", "learned", "learns", "optimize", "optimizing", "optimization"
Implements a simplified version of the Porter Stemmer:
Step 1a - Plurals:
-
sses→ss(caresses → caress) -
ies→i(ponies → poni) -
s→ `` (cats → cat)
Step 1b - Past Tense:
-
eed→ee(agreed → agree) -
ed→ `` (played → play, running → run with double consonant removal) -
ing→ `` (running → run)
Step 1c - Y suffix:
-
y→i(happy → happi, only if preceded by consonant)
Step 2 - Common Suffixes:
-
ational→ate(relational → relate) -
ation→ate(activation → activate) -
ness→ `` (goodness → good) -
enci→enc(valenci → valenc)
Limitations:
- Simplified subset (not full Porter)
- No Step 3-5 transformations
- Minimum word length: 3 characters
Removes common German suffixes in order:
Plurals and Declension:
-
ern,em,en,er,es,e,s
Derivational Suffixes:
-
ung(Handlung → Handl) -
heit(Freiheit → Frei) -
keit(Möglichkeit → Möglich) -
lich(freundlich → freund)
Limitations:
- No umlaut normalization (ä, ö, ü unchanged)
- No compound word splitting
- No strong verb handling (irregular forms)
- Order-dependent (may over-stem in edge cases)
Stemming configuration is persisted in RocksDB:
Key: ftidxmeta:table:column
Value (JSON):
{
"type": "fulltext",
"stemming_enabled": true,
"language": "de"
}Stemmed tokens are stored in the same index keys as non-stemmed:
-
Presence:
ftidx:table:column:token:PK→ "" (token is stemmed if config enabled) -
Term Frequency:
fttf:table:column:token:PK→ count -
Doc Length:
ftdlen:table:column:PK→ total_tokens
Indexes created before stemming support:
-
Behavior: Config lookup returns
{stemming_enabled: false, language: "none"} -
Migration: Recreate index with
POST /index/createand new config - No Auto-Migration: Existing indexes remain unchanged
-
POST /index/createwithoutconfigfield → no stemming (default) - Query API unchanged:
/search/fulltextautomatically uses index config - C++ API:
createFulltextIndex(table, column)→ default config
- Reduction: Stemming typically reduces unique token count by 10-30%
- Compression: Fewer unique tokens → better RocksDB compression
- Trade-off: Slight increase in false positives (over-matching)
- Impact: Negligible (stemming overhead < 1% of total query time)
- Optimization: Stemmer uses in-memory string manipulation
- Caching: Not needed (stemming is fast enough)
- Impact: +5-10% for large datasets (stemming overhead)
- Mitigation: Rebuild only needed when changing config
See tests/test_stemming.cpp:
// English stemming
EXPECT_EQ(Stemmer::stem("cats", EN), "cat");
EXPECT_EQ(Stemmer::stem("running", EN), "run");
EXPECT_EQ(Stemmer::stem("relational", EN), "relate");
// German stemming
EXPECT_EQ(Stemmer::stem("laufen", DE), "lauf");
EXPECT_EQ(Stemmer::stem("machte", DE), "macht");
EXPECT_EQ(Stemmer::stem("wirkung", DE), "wirk");// Create index with stemming
FulltextConfig config{true, "en"};
indexMgr->createFulltextIndex("articles", "content", config);
// Insert document
BaseEntity doc("doc1");
doc.setField("content", "running dogs");
indexMgr->put("articles", doc);
// Query with base form
auto [status, results] = indexMgr->scanFulltext("articles", "content", "run");
EXPECT_EQ(results.size(), 1); // Matches "running"✅ Enable stemming when:
- Content is in a supported language (EN/DE)
- Recall is more important than precision
- Users search with different word forms
- Text contains morphological variations (verbs, plurals)
❌ Disable stemming when:
- Exact matching is required (e.g., product codes, technical terms)
- Content is multilingual without dominant language
- Domain-specific terminology should not be normalized
- Precision is critical (avoid false positives)
-
Monolingual content: Use appropriate language code (
en,de) -
Mixed content: Choose dominant language or use
none -
Unknown language: Use
none(exact matching)
To change stemming config:
- Drop existing index:
POST /index/drop - Create new index with config:
POST /index/create - Data will be automatically re-indexed on next entity update
- Optional: Trigger rebuild via
POST /index/rebuild
// Umlaut normalization implemented in v1.3
- Umlaut normalization: ä→a, ö→o, ü→u for German
- More languages: FR, ES, IT, NL via Snowball integration
- Custom stemmers: Plugin interface for domain-specific rules
- Compound word splitting (German): "Fußballweltmeisterschaft" → ["fußball", "welt", "meisterschaft"]
- Lemmatization: More accurate than stemming ("better" → "good")
- N-grams: Partial matching and typo tolerance
- Phonetic matching: Soundex/Metaphone for fuzzy search
# Create index with German stemming
POST /index/create
{
"table": "gesetze",
"column": "text",
"type": "fulltext",
"config": {"stemming_enabled": true, "language": "de"}
}
# Insert document
PUT /entities/gesetze/bgb123
{"text": "Die Verträge müssen schriftlich geschlossen werden"}
# Search (matches "Vertrag", "Verträge", "Vertrags", etc.)
POST /search/fulltext
{
"table": "gesetze",
"column": "text",
"query": "Vertrag schriftlich",
"limit": 20
}# Create index with English stemming
POST /index/create
{
"table": "docs",
"column": "content",
"type": "fulltext",
"config": {"stemming_enabled": true, "language": "en"}
}
# Insert documents
PUT /entities/docs/ml101
{"content": "Machine learning algorithms optimize neural networks"}
PUT /entities/docs/ml102
{"content": "Optimizing machine learned models for production"}
# Search (matches both documents)
POST /search/fulltext
{
"table": "docs",
"column": "content",
"query": "optimize learning",
"limit": 10
}
# Response:
# [
# {"pk": "ml102", "score": 9.42}, # "Optimizing...learned"
# {"pk": "ml101", "score": 8.15} # "learning...optimize"
# ]Problem: Query returns empty results after enabling stemming
Diagnosis:
- Check if index was recreated with new config
- Verify documents were re-indexed after config change
- Test with non-stemmed query (exact token match)
Solution:
# Rebuild index to apply stemming to existing documents
POST /index/rebuild
{"table": "docs", "column": "content"}Problem: Query matches unrelated documents
Cause: Over-stemming (common with aggressive algorithms)
Example:
- "university" → "univers"
- "universal" → "univers"
- Both match despite different meanings
Solution:
- Disable stemming if precision is critical
- Use exact phrases with quotes (future feature)
- Add domain-specific stopwords
Problem: Poor results for multilingual content
Cause: Single-language stemmer applied to mixed content
Solution:
- Create separate indexes per language
- Use language detection to route queries
- Fallback to
language: "none"for mixed content
- Porter Stemmer: Martin Porter, 1980
- Snowball Algorithms: tartarus.org/martin/PorterStemmer/
-
BM25 Ranking: See
docs/search/fulltext_api.md -
HTTP API: See
openapi/openapi.yaml
Last Updated: 2025-11-02
Version: v1.1
Status: Production Ready
- AQL Overview
- AQL Syntax Reference
- EXPLAIN and PROFILE
- Hybrid Queries
- Pattern Matching
- Subquery Implementation
- Subquery Quick Reference
- Fulltext Release Notes
- Hybrid Search Design
- Fulltext Search API
- Content Search
- Pagination Benchmarks
- Stemming
- Hybrid Fusion API
- Performance Tuning
- Migration Guide
- Storage Overview
- RocksDB Layout
- Geo Schema
- Index Types
- Index Statistics
- Index Backup
- HNSW Persistence
- Vector Index
- Graph Index
- Secondary Index
- Security Overview
- RBAC and Authorization
- TLS Setup
- Certificate Pinning
- Encryption Strategy
- Column Encryption
- Key Management
- Key Rotation
- HSM Integration
- PKI Integration
- eIDAS Signatures
- PII Detection
- PII API
- Threat Model
- Hardening Guide
- Incident Response
- SBOM
- Enterprise Overview
- Scalability Features
- Scalability Strategy
- HTTP Client Pool
- Enterprise Build Guide
- Enterprise Ingestion
- Benchmarks Overview
- Compression Benchmarks
- Compression Strategy
- Memory Tuning
- Hardware Acceleration
- GPU Acceleration Plan
- CUDA Backend
- Vulkan Backend
- Multi-CPU Support
- TBB Integration
- Time Series
- Vector Operations
- Graph Features
- Temporal Graphs
- Path Constraints
- Recursive Queries
- Audit Logging
- Change Data Capture
- Transactions
- Semantic Cache
- Cursor Pagination
- Compliance Features
- GNN Embeddings
- Geo Overview
- Geo Architecture
- 3D Game Acceleration
- Geo Feature Tiering
- G3 Phase 2 Status
- G5 Implementation
- Integration Guide
- Content Architecture
- Content Pipeline
- Content Manager
- JSON Ingestion
- Content Ingestion
- Filesystem API
- Image Processor
- Geo Processor
- Policy Implementation
- Developer Guide
- Implementation Status
- Development Roadmap
- Build Strategy
- Build Acceleration
- Code Quality Guide
- AQL LET Implementation
- Audit API Implementation
- SAGA API Implementation
- PKI eIDAS
- WAL Archiving
- Architecture Overview
- Strategic Overview
- Ecosystem
- MVCC Design
- Base Entity
- Caching Strategy
- Caching Data Structures
- Docker Build
- Docker Status
- Multi-Arch CI/CD
- ARM Build Guide
- ARM Packages
- Raspberry Pi Tuning
- Packaging Guide
- Package Maintainers
- Roadmap
- Changelog
- Database Capabilities
- Implementation Summary
- Sachstandsbericht 2025
- Enterprise Final Report
- Test Report
- Build Success Report
- Integration Analysis
- Source Overview
- API Implementation
- Query Engine
- Storage Layer
- Security Implementation
- CDC Implementation
- Time Series
- Utils and Helpers
Updated: 2025-11-30