Skip to content

NEXT_STEPS_ANALYSIS

makr-code edited this page Nov 18, 2025 · 1 revision

ThemisDB - Nächste Schritte Analyse

Datum: 17. November 2025 (Aktualisiert nach AQL 100% Sprint)
Basis: Code-Analyse + Todo-Liste + Implementation Summary
Status nach AQL 100% Sprint: 65% Gesamt-Implementierung


Executive Summary

Nach Abschluss des AQL 100% Sprints (Phase 1 komplett) sind die nächsten logischen Schritte:

✅ ABGESCHLOSSEN:

  1. AQL Advanced Features100% KOMPLETT (17.11.2025)
    • LET/Variable Bindings ✅
    • OR/NOT Operators ✅
    • Window Functions ✅
    • CTEs (WITH clause) ✅
    • Subqueries ✅
    • Advanced Aggregations ✅

🎯 Priorität 1 (Sofort - Q4 2025):

  1. Content Pipeline (30% → 80%, 1-2 Wochen)
  2. Inkrementelle Backups (0% → 90%, 1 Woche)
  3. Admin Tools MVP (27% → 70%, 2-3 Wochen)

🎯 Priorität 2 (Q1 2026): 4. HSM/eIDAS PKI (Docs vorhanden → Production, 2 Wochen) 5. Security Hardening (45% → 80%, 2-3 Wochen)


Sprint 1 Ergebnisse (17.11.2025)

✅ AQL 100% - KOMPLETT IMPLEMENTIERT

Commits: 5
Zeilen Code: +5,012
Tests: +70
Dauer: 1 Tag

Implementierte Features:

  1. LET/Variable Bindings (608 Zeilen, 25+ Tests)

    • LetEvaluator class
    • Arithmetische Operationen (+, -, *, /, %)
    • String-Funktionen (CONCAT, SUBSTRING, UPPER, LOWER)
    • Math-Funktionen (ABS, MIN, MAX, CEIL, FLOOR, ROUND)
    • Nested field access (doc.address.city)
    • Array indexing (doc.tags[0])
    • Variable chaining (LET x = ..., LET y = x * 2)
  2. OR/NOT Operators (159 Zeilen, 15+ Tests)

    • De Morgan's Laws transformation
    • NOT (A OR B) = (NOT A) AND (NOT B)
    • NOT (A AND B) = (NOT A) OR (NOT B)
    • NEQ conversion: A != B = (A < B) OR (A > B)
    • Double negation elimination
    • Index-Merge für OR queries
  3. Window Functions (800+ Zeilen, 20+ Tests)

    • ROW_NUMBER(), RANK(), DENSE_RANK()
    • LAG(expr, offset), LEAD(expr, offset)
    • FIRST_VALUE(expr), LAST_VALUE(expr)
    • PARTITION BY (multi-column)
    • ORDER BY (multi-column, ASC/DESC)
    • Frame definitions (ROWS/RANGE BETWEEN ... AND ...)
  4. CTEs (WITH clause) (200+ Zeilen)

    • Common Table Expressions
    • Temporary named result sets
    • Non-recursive CTEs (full stub)
    • Recursive CTEs (Phase 2 placeholder)
  5. Subqueries (200+ Zeilen)

    • Scalar subqueries: (SELECT value)
    • IN subqueries: value IN (SELECT ...)
    • EXISTS/NOT EXISTS
    • Correlated subqueries (Phase 2 placeholder)
  6. Advanced Aggregations (300+ Zeilen, 25+ Tests)

    • PERCENTILE(expr, p), MEDIAN(expr)
    • STDDEV(expr), STDDEV_POP(expr)
    • VARIANCE(expr), VAR_POP(expr)
    • IQR(expr), MAD(expr), RANGE(expr)

Detaillierte Analyse (Aktualisiert)

Implementierungs-Schritte:

  1. LET Evaluator (4-6h)

    // src/query/let_evaluator.cpp
    class LetEvaluator {
        std::unordered_map<std::string, nlohmann::json> bindings_;
    public:
        void evaluateLet(const LetNode& node, const nlohmann::json& current_doc);
        nlohmann::json resolveVariable(const std::string& var_name);
    };
  2. Integration in Query Engine (2-3h)

    • Add LET evaluator to query execution pipeline
    • Variable resolution in FILTER/RETURN expressions
  3. Tests (3-4h)

    • Unit tests: LET mit Arithmetik, String-Ops, Nested Objects
    • Integration tests: LET + FILTER, LET in Joins
    • Edge cases: Undefined variables, circular dependencies

DoD:

  • ✅ LET bindings funktionieren in FOR/FILTER/RETURN
  • ✅ Mehrere LETs pro Query
  • ✅ LETs können frühere LETs referenzieren
  • ✅ 15+ Tests PASSING

Files zu ändern:

  • src/query/aql_translator.cpp - LET evaluation logic
  • src/query/query_engine.cpp - Variable resolution

1. Content Pipeline Vervollständigen (HÖCHSTE PRIORITÄT)

Status: 30% implementiert, Basis-Schema vorhanden
Impact: RAG/Hybrid-Search Workloads blockiert
Aufwand: 1-2 Wochen

1.1 Advanced Extraction (PDF/DOCX/XLSX)

Code-Status:

// ✅ Text Processor vorhanden (src/content/text_processor.cpp)
// ✅ Mock CLIP Processor (src/content/mock_clip_processor.cpp)
// ❌ Keine echten PDF/DOCX Parser

TODO-Marker im Code:

  • src/api/http_server.cpp:4 - "TODO: Implement in Phase 4, Task 11"
  • Content-Pipeline nur Mockups

Implementierungs-Schritte:

  1. PDF Extraction (6-8h)

    • Library: poppler-cpp oder pdfium
    • Text + Metadata (author, created, pages)
    • Image extraction für multi-modal
  2. DOCX Extraction (4-6h)

    • Library: libxml2 (OpenXML parsing)
    • Text + Styles + Metadata
  3. XLSX Extraction (4-6h)

    • Library: xlnt oder libxlsx
    • Tabellen → JSON/CSV
  4. Tests (4-5h)

    • Real-world PDFs (100+ pages)
    • Complex DOCX (images, tables, formulas)
    • Large XLSX (10k rows)

DoD:

  • ✅ PDF/DOCX/XLSX extraction funktioniert
  • ✅ Metadata preservation
  • ✅ Error handling für corrupted files
  • ✅ Integration mit ContentManager

Files zu ändern:

  • src/content/pdf_processor.cpp - NEW
  • src/content/docx_processor.cpp - NEW
  • src/content/xlsx_processor.cpp - NEW
  • CMakeLists.txt - Add poppler/libxml2/xlnt
  • vcpkg.json - Add dependencies

2.2 Chunking Optimierung

Code-Status:

// ⚠️ Basis-Chunking vorhanden
// ❌ Keine semantische Chunking-Strategies

Implementierungs-Schritte:

  1. Semantic Chunking (6-8h)

    • Sentence-level chunking (NLTK/spaCy)
    • Paragraph-preserving chunking
    • Sliding window mit overlap
  2. Chunk Metadata (3-4h)

    • Position tracking (start_offset, end_offset)
    • Parent-child relationships
    • Chunk embeddings
  3. Batch Upload Optimization (4-6h)

    • Parallel chunk processing (Intel TBB)
    • RocksDB WriteBatch für bulk inserts

DoD:

  • ✅ 3 Chunking-Strategies (fixed-size, sentence, paragraph)
  • ✅ Chunk metadata vollständig
  • ✅ 10x faster bulk upload
  • ✅ Tests PASSING

Files zu ändern:

  • src/content/chunking_strategy.cpp - NEW
  • src/content/content_manager.cpp - Batch optimization
  • tests/test_chunking.cpp - NEW

3. Admin Tools MVP (MEDIUM)

Status: 27% implementiert (nur AuditLogViewer produktiv)
Impact: Operations, Compliance, DSGVO
Aufwand: 2-3 Wochen

3.1 Tool-Status Audit

Aktuelle Tools (WPF .NET 8):

Tool Code Status Backend API Tests %
AuditLogViewer ✅ Implementiert /audit/logs 90%
SAGAVerifier ✅ Implementiert /saga/batches ⚠️ Minimal 70%
PIIManager ✅ Implementiert /pii/* ⚠️ Minimal 60%
KeyRotationDashboard ✅ MVP (Demo-Daten) /keys/* 40%
RetentionManager ✅ MVP (Demo-Daten) ⚠️ Teilweise 30%
ClassificationDashboard ✅ MVP (Demo-Daten) /classification/* 40%
ComplianceReports ✅ MVP (Demo-Daten) /reports/* 40%

Durchschnitt: 27% (stark durch fehlende Tests und echte Backend-Integration gezogen)

3.2 Kritische Gaps

Backend-APIs fehlen:

  • /pii/* - VORHANDEN (implementiert in Critical Sprint)
  • /keys/* - VORHANDEN
  • /classification/* - VORHANDEN
  • ⚠️ /retention/* - TEILWEISE (ContinuousAggregateManager vorhanden, kein HTTP-Endpoint)
  • /reports/* - VORHANDEN

Action Items:

  1. Retention API Endpoint (4-6h)

    // src/server/http_server.cpp
    CROW_ROUTE(app, "/api/retention/policies").methods("GET"_method)
    CROW_ROUTE(app, "/api/retention/policies").methods("POST"_method)
    CROW_ROUTE(app, "/api/retention/execute").methods("POST"_method)
  2. Integration Tests (8-10h)

    • E2E tests für jedes Tool
    • Mock Backend → Real Backend migration
  3. Deployment Scripts (3-4h)

    • MSI Installer (WiX Toolset)
    • Auto-Update mechanism

DoD:

  • ✅ Alle 7 Tools mit Live-Backend verbunden
  • ✅ Integration tests PASSING
  • ✅ Deployment-ready MSI

Files zu ändern:

  • src/server/http_server.cpp - Retention endpoints
  • tools/*/ViewModels/*.cs - Remove mock data
  • tools/deployment/build.ps1 - NEW

4. Inkrementelle Backups (CRITICAL for Production)

Status: 0% implementiert (nur RocksDB Checkpoints)
Impact: Data loss prevention, disaster recovery
Aufwand: 1 Woche

4.1 WAL-Archiving

Code-Status:

// ✅ RocksDB Checkpoints implementiert
// ❌ Keine WAL-Archivierung
// ❌ Keine Point-in-Time Recovery

TODO-Marker:

  • docs/development/todo.md:60 - "Inkrementelle Backups / WAL-Archiving — TODO"

Implementierungs-Schritte:

  1. WAL Archive Manager (8-10h)

    class WALArchiveManager {
        void archiveWAL(const std::string& wal_file, const std::string& archive_path);
        void restoreFromWAL(const std::string& archive_path, uint64_t target_timestamp);
        std::vector<WALFile> listArchivedWALs();
    };
  2. Incremental Backup (6-8h)

    • Copy only changed WAL files since last backup
    • Manifest file (backup_manifest.json) with timestamps
  3. Point-in-Time Recovery (8-10h)

    • Restore checkpoint + replay WAL files until target timestamp
    • Verify data integrity after recovery
  4. Automated Backup Jobs (4-6h)

    • Cron-style scheduler (every 6h, daily, weekly)
    • Retention policy (keep last 7 dailies, 4 weeklies, 12 monthlies)
  5. Cloud Storage Integration (6-8h)

    • S3 upload via aws-sdk-cpp
    • Azure Blob Storage via azure-storage-cpp
    • Google Cloud Storage via google-cloud-cpp

DoD:

  • ✅ Incremental backups funktionieren
  • ✅ Point-in-Time Recovery tested
  • ✅ S3/Azure/GCS upload
  • ✅ Automated schedules
  • ✅ Restore tests PASSING

Files zu ändern:

  • include/backup/wal_archive_manager.h - NEW
  • src/backup/wal_archive_manager.cpp - NEW
  • src/backup/backup_scheduler.cpp - NEW
  • src/server/http_server.cpp - Backup endpoints
  • tests/test_backup_restore.cpp - NEW

5. HSM/eIDAS PKI Production-Ready (HIGH)

Status: Docs vorhanden (1,111 lines), keine HSM-Integration
Impact: Qualified eIDAS signatures für Production
Aufwand: 2 Wochen

5.1 Vault HSM Integration

Code-Status:

// ✅ VaultKeyProvider vorhanden (src/security/vault_key_provider.cpp)
// ✅ PKIClient vorhanden (src/security/vcc_pki_client.cpp)
// ❌ Keine HSM-Integration

TODO-Marker:

  • src/security/vcc_pki_client.cpp:348 - "TODO: Implement full X.509 chain validation"
  • docs/development/todo.md:60 - "eIDAS-konforme Signaturen / PKI Integration (Produktiv-Ready mit HSM) — TODO"

Implementierungs-Schritte:

  1. Vault Transit Engine (6-8h)

    class VaultHSMProvider : public PKIClient {
        std::string sign(const std::string& data) override {
            // POST /v1/transit/sign/my-key
            // HSM-backed signing
        }
    };
  2. X.509 Chain Validation (4-6h)

    • OpenSSL X509_verify_cert()
    • CRL checking
    • OCSP validation
  3. Qualified Timestamp Authority (6-8h)

    • RFC 3161 timestamp requests
    • Timestamp verification
    • Integration mit SAGA events
  4. eIDAS Compliance Tests (8-10h)

    • Qualified signature validation
    • Timestamp validation
    • Full audit trail test

DoD:

  • ✅ Vault Transit Engine integration
  • ✅ X.509 chain validation
  • ✅ Qualified TSA integration
  • ✅ eIDAS compliance validated
  • ✅ Production deployment guide

Files zu ändern:

  • src/security/vault_hsm_provider.cpp - NEW
  • src/security/vcc_pki_client.cpp - X.509 validation
  • src/utils/timestamp_authority.cpp - NEW
  • tests/test_eid as_compliance.cpp - NEW

Prioritäten-Matrix

Task Business Value Technical Complexity Effort Priority
LET/Subqueries ⭐⭐⭐⭐⭐ ⭐⭐⭐ 2-3 days P0
OR/NOT Index-Merge ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ 3-4 days P0
PDF/DOCX Extraction ⭐⭐⭐⭐ ⭐⭐⭐ 2-3 days P1
Incremental Backups ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ 5-7 days P1
Admin Tools Integration ⭐⭐⭐ ⭐⭐ 3-4 days P2
Hash-Join ⭐⭐⭐ ⭐⭐⭐⭐ 4-5 days P2
HSM/eIDAS ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ 10-12 days P2
Chunking Optimization ⭐⭐⭐ ⭐⭐ 2-3 days P3

Empfohlene Roadmap

Sprint 1 (Week 1-2): AQL Advanced Features

Ziel: AQL von 65% auf 85%

  • Day 1-3: LET/Subqueries implementieren + tests
  • Day 4-7: OR/NOT mit Index-Merge
  • Day 8-10: Advanced Joins (Hash-Join Basis)

Deliverable: AQL Production-Ready für komplexe Queries


Sprint 2 (Week 3-4): Content Pipeline + Backups

Ziel: Content 30% → 60%, Backups 0% → 90%

  • Day 1-4: PDF/DOCX/XLSX Extraction
  • Day 5-6: Chunking Optimization
  • Day 7-10: WAL-Archiving + Point-in-Time Recovery

Deliverable: RAG-Ready Content Pipeline, Production Backups


Sprint 3 (Week 5-6): Admin Tools + HSM

Ziel: Admin Tools 27% → 70%, HSM Integration

  • Day 1-4: Admin Tools Backend-Integration + Tests
  • Day 5-10: Vault HSM + eIDAS Compliance

Deliverable: Operations-Ready Admin Suite, Qualified Signatures


Code-TODOs Priorisiert

CRITICAL (Sprint 1)

  1. src/query/aql_translator.cpp:31 - LET execution
  2. src/query/query_optimizer.cpp - OR cost model
  3. src/index/secondary_index.cpp - Index merge utilities

HIGH (Sprint 2)

  1. src/content/pdf_processor.cpp - NEW (PDF extraction)
  2. src/backup/wal_archive_manager.cpp - NEW (WAL archiving)
  3. src/server/http_server.cpp - Retention endpoints

MEDIUM (Sprint 3)

  1. src/security/vault_hsm_provider.cpp - NEW (HSM integration)
  2. src/security/vcc_pki_client.cpp:348 - X.509 validation
  3. tools/*/ViewModels/*.cs - Remove mock data

Success Metrics

Sprint 1 Goals:

  • ✅ AQL: 85% implementation (up from 65%)
  • ✅ LET: 15+ tests PASSING
  • ✅ OR: 20+ tests PASSING
  • ✅ Hash-Join: 10x speedup on large joins

Sprint 2 Goals:

  • ✅ Content: 60% implementation (up from 30%)
  • ✅ PDF/DOCX: Real-world extraction works
  • ✅ Backups: Point-in-Time Recovery validated
  • ✅ Automated backup jobs running

Sprint 3 Goals:

  • ✅ Admin Tools: 70% implementation (up from 27%)
  • ✅ All 7 tools with live backends
  • ✅ HSM: Vault Transit Engine integrated
  • ✅ eIDAS: Qualified signatures validated

Overall Target: 70% Gesamt-Implementierung (up from 61%)


Abhängigkeiten

External Libraries zu installieren:

  • poppler-cpp (PDF extraction)
  • libxml2 (DOCX extraction)
  • xlnt (XLSX extraction)
  • aws-sdk-cpp (S3 backups)
  • azure-storage-cpp (Azure backups)
  • google-cloud-cpp (GCS backups)

vcpkg.json Updates:

{
  "dependencies": [
    "poppler",
    "libxml2",
    "xlnt",
    "aws-sdk-cpp[s3]",
    "azure-storage-cpp",
    "google-cloud-cpp[storage]"
  ]
}

Risiken & Mitigations

Risiko Impact Wahrscheinlichkeit Mitigation
LET-Implementierung komplex HIGH MEDIUM Start mit einfachen Expressions, schrittweise erweitern
Index-Merge Performance MEDIUM LOW Benchmarks parallel zur Entwicklung
PDF-Library Integration MEDIUM MEDIUM POC mit poppler vor vollständiger Integration
HSM-Kosten HIGH LOW Dev-Umgebung mit Mock HSM, Production-Tests separat
Backup-Storage-Kosten MEDIUM MEDIUM Retention policies implementieren (auto-delete old backups)

Fazit

Empfohlene Next Steps (Reihenfolge):

  1. JETZT: LET/Subqueries (3 Tage) - BLOCKER für Production
  2. DANN: OR/NOT Index-Merge (4 Tage) - BLOCKER für komplexe Queries
  3. PARALLEL: Incremental Backups (5 Tage) - CRITICAL für Production
  4. DANACH: Content Pipeline (3 Tage) - Enables RAG
  5. SPÄTER: Admin Tools + HSM (2 Wochen) - Operations Excellence

Total Aufwand: ~6 Wochen für alle P0/P1 Tasks
Expected Outcome: 70% Gesamt-Implementierung, Production-Ready AQL, Operations Excellence

ThemisDB Wiki

Getting Started

SDKs and Clients

Query Language (AQL)

Search and Retrieval

Storage and Indexes

Security and Compliance

Enterprise Features

Performance and Optimization

Features and Capabilities

Geo and Spatial

Content and Ingestion

Sharding and Scaling

APIs and Integration

Admin Tools

Observability

Development

Architecture

Deployment and Operations

Exporters and Integrations

Reports and Status

Compliance and Governance

Testing and Quality

Source Code Documentation

Reference


Updated: 2025-11-30

Clone this wiki locally