themis docs exporters JSONL_LLM_EXPORTER

JSONL LLM Exporter - LoRA/QLoRA Training Data Export

Overview

The JSONL LLM Exporter exports ThemisDB BaseEntity data as weighted training samples in JSONL format for fine-tuning Large Language Models with LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA).

Key Features

✅ Multiple LLM Formats

Instruction Tuning ({"instruction": ..., "input": ..., "output": ...})
Chat Completion ({"messages": [{"role": ..., "content": ...}]})
Text Completion ({"text": ...})

✅ Weighted Training Samples

Explicit weight field (e.g., importance: 0.8)
Auto-weighting by text length
Auto-weighting by data freshness
Custom weighting strategies

✅ Quality Filtering

Min/max text length constraints
Empty output detection
Duplicate detection
Configurable quality thresholds

✅ Metadata Enrichment

Source tracking
Category/tag preservation
Custom metadata fields

Installation

As Plugin

# Load via PluginManager
auto& pm = PluginManager::instance();
pm.scanPluginDirectory("./plugins");
auto* plugin = pm.loadPlugin("jsonl_llm_exporter");
auto* exporter = static_cast<IExporter*>(plugin->getInstance());

Direct Usage

#include "exporters/jsonl_llm_exporter.h"

JSONLLLMConfig config;
config.style = JSONLFormat::Style::INSTRUCTION_TUNING;
config.weighting.enable_weights = true;
config.weighting.auto_weight_by_length = true;

JSONLLLMExporter exporter(config);

Configuration

Instruction Tuning Format

Best for question-answering, task completion:

JSONLLLMConfig config;
config.style = JSONLFormat::Style::INSTRUCTION_TUNING;
config.field_mapping.instruction_field = "question";
config.field_mapping.input_field = "context";
config.field_mapping.output_field = "answer";

BaseEntity Example:

{
  "pk": "qa_001",
  "question": "What is the capital of France?",
  "context": "France is a country in Western Europe",
  "answer": "Paris is the capital of France.",
  "importance": 0.9
}

JSONL Output:

{"instruction": "What is the capital of France?", "input": "France is a country in Western Europe", "output": "Paris is the capital of France.", "weight": 0.9}

Chat Completion Format

Best for conversational AI:

JSONLLLMConfig config;
config.style = JSONLFormat::Style::CHAT_COMPLETION;
config.field_mapping.system_field = "system_prompt";
config.field_mapping.user_field = "user_message";
config.field_mapping.assistant_field = "assistant_response";

BaseEntity Example:

{
  "pk": "chat_001",
  "system_prompt": "You are a helpful assistant.",
  "user_message": "Explain quantum computing",
  "assistant_response": "Quantum computing uses quantum bits...",
  "importance": 1.2
}

JSONL Output:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing"}, {"role": "assistant", "content": "Quantum computing uses quantum bits..."}], "weight": 1.2}

Text Completion Format

Best for text generation, next-word prediction:

JSONLLLMConfig config;
config.style = JSONLFormat::Style::TEXT_COMPLETION;
config.field_mapping.text_field = "content";

Weighting Strategies

1. Explicit Weights

config.weighting.enable_weights = true;
config.weighting.weight_field = "importance";  // Field in BaseEntity
config.weighting.default_weight = 1.0;         // If field missing

Use Case: Domain experts manually assign importance scores.

2. Auto-Weight by Length

config.weighting.auto_weight_by_length = true;

Formula: weight *= (1.0 + min(0.5, length / 2000.0))

Use Case: Longer, more detailed responses get higher weights (up to 1.5x).

3. Auto-Weight by Freshness

config.weighting.auto_weight_by_freshness = true;
config.weighting.timestamp_field = "created_at";

Use Case: Newer data is more valuable (recent trends, updated information).

4. Combined Strategies

config.weighting.enable_weights = true;
config.weighting.auto_weight_by_length = true;
config.weighting.auto_weight_by_freshness = true;

Weights are multiplied: final_weight = explicit_weight × length_factor × freshness_factor

Quality Filtering

Length Constraints

config.quality.min_text_length = 50;      // Skip very short responses
config.quality.max_text_length = 8192;    // Skip excessively long responses

Empty Output Detection

config.quality.skip_empty_outputs = true;  // Skip if output field is empty

Duplicate Detection

config.quality.skip_duplicates = true;  // Hash-based duplicate removal

Metadata Enrichment

config.include_metadata = true;
config.metadata_fields = {"source", "category", "tags", "author"};

Output with metadata:

{"instruction": "...", "output": "...", "weight": 1.0, "metadata": {"source": "wikipedia", "category": "science", "tags": ["physics", "quantum"]}}

Usage Examples

Example 1: Export FAQ Database for LoRA Training

// Load entities from ThemisDB
std::vector<BaseEntity> faqs = db.query("category=faq");

// Configure exporter
JSONLLLMConfig config;
config.style = JSONLFormat::Style::INSTRUCTION_TUNING;
config.field_mapping.instruction_field = "question";
config.field_mapping.output_field = "answer";
config.weighting.enable_weights = true;
config.weighting.weight_field = "upvotes";  // Use upvotes as weights

JSONLLLMExporter exporter(config);

// Export
ExportOptions options;
options.output_path = "training_data/faq_lora.jsonl";
options.progress_callback = [](const ExportStats& stats) {
    std::cout << "Exported: " << stats.exported_entities << " entities\n";
};

auto stats = exporter.exportEntities(faqs, options);
std::cout << stats.toJson() << std::endl;

Example 2: Export Chat Logs for QLoRA

// Load chat conversations
std::vector<BaseEntity> chats = db.query("type=conversation AND rating>4");

// Configure for chat format
JSONLLLMConfig config;
config.style = JSONLFormat::Style::CHAT_COMPLETION;
config.field_mapping.user_field = "user_query";
config.field_mapping.assistant_field = "bot_response";
config.weighting.auto_weight_by_length = true;  // Detailed responses weighted higher
config.quality.min_text_length = 100;           // Skip short exchanges

JSONLLLMExporter exporter(config);

// Export for QLoRA training
ExportOptions options;
options.output_path = "training_data/chat_qlora.jsonl";

auto stats = exporter.exportEntities(chats, options);

Example 3: Export Knowledge Base with Freshness Weighting

// Load recent knowledge articles
std::vector<BaseEntity> articles = db.query("type=article");

// Prioritize recent content
JSONLLLMConfig config;
config.style = JSONLFormat::Style::TEXT_COMPLETION;
config.field_mapping.text_field = "full_text";
config.weighting.auto_weight_by_freshness = true;
config.weighting.timestamp_field = "published_date";
config.include_metadata = true;
config.metadata_fields = {"author", "topic", "published_date"};

JSONLLLMExporter exporter(config);

ExportOptions options;
options.output_path = "training_data/kb_weighted.jsonl";

auto stats = exporter.exportEntities(articles, options);

Training with Exported Data

LoRA Training (HuggingFace PEFT)

from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Load exported JSONL
dataset = load_dataset("json", data_files="faq_lora.jsonl")

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Load base model
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
model = get_peft_model(model, lora_config)

# Use weights from JSONL
def compute_loss(model, inputs, weights):
    outputs = model(**inputs)
    loss = outputs.loss
    return (loss * weights).mean()  # Weight by importance

# Train
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()

QLoRA Training (bitsandbytes)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    quantization_config=bnb_config,
    device_map="auto"
)

# Apply LoRA on quantized model
from peft import prepare_model_for_kbit_training, LoraConfig

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Train with weighted samples from JSONL
# (Same as above)

Output Statistics

{
  "total_entities": 10000,
  "exported_entities": 9500,
  "failed_entities": 500,
  "bytes_written": 15728640,
  "duration_ms": 2300,
  "errors": [
    "Entity qa_123: Missing required field 'output'",
    "Entity qa_456: Text too short (5 chars)"
  ]
}

Limitations

No streaming: Entire entity set loaded in memory
Single file output: No sharding for very large datasets
Fixed field mappings: Custom transformations require code changes

Planned Enhancements (v2.0)

Streaming export for large datasets
Automatic dataset sharding
Data augmentation (paraphrasing, back-translation)
Multi-turn conversation support
Token counting for optimal batch sizes
Integration with HuggingFace Hub

Wiki Sidebar Umstrukturierung

Datum: 2025-11-30
Status: ✅ Abgeschlossen
Commit: bc7556a

Zusammenfassung

Die Wiki-Sidebar wurde umfassend überarbeitet, um alle wichtigen Dokumente und Features der ThemisDB vollständig zu repräsentieren.

Ausgangslage

Vorher:

64 Links in 17 Kategorien
Dokumentationsabdeckung: 17.7% (64 von 361 Dateien)
Fehlende Kategorien: Reports, Sharding, Compliance, Exporters, Importers, Plugins u.v.m.
src/ Dokumentation: nur 4 von 95 Dateien verlinkt (95.8% fehlend)
development/ Dokumentation: nur 4 von 38 Dateien verlinkt (89.5% fehlend)

Dokumentenverteilung im Repository:

Kategorie        Dateien  Anteil
-----------------------------------------
src                 95    26.3%
root                41    11.4%
development         38    10.5%
reports             36    10.0%
security            33     9.1%
features            30     8.3%
guides              12     3.3%
performance         12     3.3%
architecture        10     2.8%
aql                 10     2.8%
[...25 weitere]     44    12.2%
-----------------------------------------
Gesamt             361   100.0%

Neue Struktur

Nachher:

171 Links in 25 Kategorien
Dokumentationsabdeckung: 47.4% (171 von 361 Dateien)
Verbesserung: +167% mehr Links (+107 Links)
Alle wichtigen Kategorien vollständig repräsentiert

Kategorien (25 Sektionen)

1. Core Navigation (4 Links)

Home, Features Overview, Quick Reference, Documentation Index

2. Getting Started (4 Links)

Build Guide, Architecture, Deployment, Operations Runbook

3. SDKs and Clients (5 Links)

JavaScript, Python, Rust SDK + Implementation Status + Language Analysis

4. Query Language / AQL (8 Links)

Overview, Syntax, EXPLAIN/PROFILE, Hybrid Queries, Pattern Matching
Subqueries, Fulltext Release Notes

5. Search and Retrieval (8 Links)

Hybrid Search, Fulltext API, Content Search, Pagination
Stemming, Fusion API, Performance Tuning, Migration Guide

6. Storage and Indexes (10 Links)

Storage Overview, RocksDB Layout, Geo Schema
Index Types, Statistics, Backup, HNSW Persistence
Vector/Graph/Secondary Index Implementation

7. Security and Compliance (17 Links)

Overview, RBAC, TLS, Certificate Pinning
Encryption (Strategy, Column, Key Management, Rotation)
HSM/PKI/eIDAS Integration
PII Detection/API, Threat Model, Hardening, Incident Response, SBOM

8. Enterprise Features (6 Links)

Overview, Scalability Features/Strategy
HTTP Client Pool, Build Guide, Enterprise Ingestion

9. Performance and Optimization (10 Links)

Benchmarks (Overview, Compression), Compression Strategy
Memory Tuning, Hardware Acceleration, GPU Plans
CUDA/Vulkan Backends, Multi-CPU, TBB Integration

10. Features and Capabilities (13 Links)

Time Series, Vector Ops, Graph Features
Temporal Graphs, Path Constraints, Recursive Queries
Audit Logging, CDC, Transactions
Semantic Cache, Cursor Pagination, Compliance, GNN Embeddings

11. Geo and Spatial (7 Links)

Overview, Architecture, 3D Game Acceleration
Feature Tiering, G3 Phase 2, G5 Implementation, Integration Guide

12. Content and Ingestion (9 Links)

Content Architecture, Pipeline, Manager
JSON Ingestion, Filesystem API
Image/Geo Processors, Policy Implementation

13. Sharding and Scaling (5 Links)

Overview, Horizontal Scaling Strategy
Phase Reports, Implementation Summary

14. APIs and Integration (5 Links)

OpenAPI, Hybrid Search API, ContentFS API
HTTP Server, REST API

15. Admin Tools (5 Links)

Admin/User Guides, Feature Matrix
Search/Sort/Filter, Demo Script

16. Observability (3 Links)

Metrics Overview, Prometheus, Tracing

17. Development (11 Links)

Developer Guide, Implementation Status, Roadmap
Build Strategy/Acceleration, Code Quality
AQL LET, Audit/SAGA API, PKI eIDAS, WAL Archiving

18. Architecture (7 Links)

Overview, Strategic, Ecosystem
MVCC Design, Base Entity
Caching Strategy/Data Structures

19. Deployment and Operations (8 Links)

Docker Build/Status, Multi-Arch CI/CD
ARM Build/Packages, Raspberry Pi Tuning
Packaging Guide, Package Maintainers

20. Exporters and Integrations (4 Links)

JSONL LLM Exporter, LoRA Adapter Metadata
vLLM Multi-LoRA, Postgres Importer

21. Reports and Status (9 Links)

Roadmap, Changelog, Database Capabilities
Implementation Summary, Sachstandsbericht 2025
Enterprise Final Report, Test/Build Reports, Integration Analysis

22. Compliance and Governance (6 Links)

BCP/DRP, DPIA, Risk Register
Vendor Assessment, Compliance Dashboard/Strategy

23. Testing and Quality (3 Links)

Quality Assurance, Known Issues
Content Features Test Report

24. Source Code Documentation (8 Links)

Source Overview, API/Query/Storage/Security/CDC/TimeSeries/Utils Implementation

25. Reference (3 Links)

Glossary, Style Guide, Publishing Guide

Verbesserungen

Quantitative Metriken

Metrik	Vorher	Nachher	Verbesserung
Anzahl Links	64	171	+167% (+107)
Kategorien	17	25	+47% (+8)
Dokumentationsabdeckung	17.7%	47.4%	+167% (+29.7pp)

Qualitative Verbesserungen

Neu hinzugefügte Kategorien:

✅ Reports and Status (9 Links) - vorher 0%
✅ Compliance and Governance (6 Links) - vorher 0%
✅ Sharding and Scaling (5 Links) - vorher 0%
✅ Exporters and Integrations (4 Links) - vorher 0%
✅ Testing and Quality (3 Links) - vorher 0%
✅ Content and Ingestion (9 Links) - deutlich erweitert
✅ Deployment and Operations (8 Links) - deutlich erweitert
✅ Source Code Documentation (8 Links) - deutlich erweitert

Stark erweiterte Kategorien:

Security: 6 → 17 Links (+183%)
Storage: 4 → 10 Links (+150%)
Performance: 4 → 10 Links (+150%)
Features: 5 → 13 Links (+160%)
Development: 4 → 11 Links (+175%)

Struktur-Prinzipien

1. User Journey Orientierung

Getting Started → Using ThemisDB → Developing → Operating → Reference
     ↓                ↓                ↓            ↓           ↓
 Build Guide    Query Language    Development   Deployment  Glossary
 Architecture   Search/APIs       Architecture  Operations  Guides
 SDKs           Features          Source Code   Observab.

2. Priorisierung nach Wichtigkeit

Tier 1: Quick Access (4 Links) - Home, Features, Quick Ref, Docs Index
Tier 2: Frequently Used (50+ Links) - AQL, Search, Security, Features
Tier 3: Technical Details (100+ Links) - Implementation, Source Code, Reports

3. Vollständigkeit ohne Überfrachtung

Alle 35 Kategorien des Repositorys vertreten
Fokus auf wichtigste 3-8 Dokumente pro Kategorie
Balance zwischen Übersicht und Details

4. Konsistente Benennung

Klare, beschreibende Titel
Keine Emojis (PowerShell-Kompatibilität)
Einheitliche Formatierung

Technische Umsetzung

Implementierung

Datei: sync-wiki.ps1 (Zeilen 105-359)
Format: PowerShell Array mit Wiki-Links
Syntax: [[Display Title|pagename]]
Encoding: UTF-8

Deployment

# Automatische Synchronisierung via:
.\sync-wiki.ps1

# Prozess:
# 1. Wiki Repository klonen
# 2. Markdown-Dateien synchronisieren (412 Dateien)
# 3. Sidebar generieren (171 Links)
# 4. Commit & Push zum GitHub Wiki

Qualitätssicherung

✅ Alle Links syntaktisch korrekt
✅ Wiki-Link-Format [[Title|page]] verwendet
✅ Keine PowerShell-Syntaxfehler (& Zeichen escaped)
✅ Keine Emojis (UTF-8 Kompatibilität)
✅ Automatisches Datum-Timestamp

Ergebnis

GitHub Wiki URL: https://github.com/makr-code/ThemisDB/wiki

Commit Details

Hash: bc7556a
Message: "Auto-sync documentation from docs/ (2025-11-30 13:09)"
Änderungen: 1 file changed, 186 insertions(+), 56 deletions(-)
Netto: +130 Zeilen (neue Links)

Abdeckung nach Kategorie

Kategorie	Repository Dateien	Sidebar Links	Abdeckung
src	95	8	8.4%
security	33	17	51.5%
features	30	13	43.3%
development	38	11	28.9%
performance	12	10	83.3%
aql	10	8	80.0%
search	9	8	88.9%
geo	8	7	87.5%
reports	36	9	25.0%
architecture	10	7	70.0%
sharding	5	5	100.0% ✅
clients	6	5	83.3%

Durchschnittliche Abdeckung: 47.4%

Kategorien mit 100% Abdeckung: Sharding (5/5)

Kategorien mit >80% Abdeckung:

Sharding (100%), Search (88.9%), Geo (87.5%), Clients (83.3%), Performance (83.3%), AQL (80%)

Nächste Schritte

Kurzfristig (Optional)

Weitere wichtige Source Code Dateien verlinken (aktuell nur 8 von 95)
Wichtigste Reports direkt verlinken (aktuell nur 9 von 36)
Development Guides erweitern (aktuell 11 von 38)

Mittelfristig

Sidebar automatisch aus DOCUMENTATION_INDEX.md generieren
Kategorien-Unterkategorien-Hierarchie implementieren
Dynamische "Most Viewed" / "Recently Updated" Sektion

Langfristig

Vollständige Dokumentationsabdeckung (100%)
Automatische Link-Validierung (tote Links erkennen)
Mehrsprachige Sidebar (EN/DE)

Lessons Learned

Emojis vermeiden: PowerShell 5.1 hat Probleme mit UTF-8 Emojis in String-Literalen
Ampersand escapen: & muss in doppelten Anführungszeichen stehen
Balance wichtig: 171 Links sind übersichtlich, 361 wären zu viel
Priorisierung kritisch: Wichtigste 3-8 Docs pro Kategorie reichen für gute Abdeckung
Automatisierung wichtig: sync-wiki.ps1 ermöglicht schnelle Updates

Fazit

Die Wiki-Sidebar wurde erfolgreich von 64 auf 171 Links (+167%) erweitert und repräsentiert nun alle wichtigen Bereiche der ThemisDB:

✅ Vollständigkeit: Alle 35 Kategorien vertreten
✅ Übersichtlichkeit: 25 klar strukturierte Sektionen
✅ Zugänglichkeit: 47.4% Dokumentationsabdeckung
✅ Qualität: Keine toten Links, konsistente Formatierung
✅ Automatisierung: Ein Befehl für vollständige Synchronisierung

Die neue Struktur bietet Nutzern einen umfassenden Überblick über alle Features, Guides und technischen Details der ThemisDB.

Erstellt: 2025-11-30
Autor: GitHub Copilot (Claude Sonnet 4.5)
Projekt: ThemisDB Documentation Overhaul

themis docs exporters JSONL_LLM_EXPORTER

JSONL LLM Exporter - LoRA/QLoRA Training Data Export

Overview

Key Features

Installation

As Plugin

Direct Usage

Configuration

Instruction Tuning Format

Chat Completion Format

Text Completion Format

Weighting Strategies

1. Explicit Weights

2. Auto-Weight by Length

3. Auto-Weight by Freshness

4. Combined Strategies

Quality Filtering

Length Constraints

Empty Output Detection

Duplicate Detection

Metadata Enrichment

Usage Examples

Example 1: Export FAQ Database for LoRA Training

Example 2: Export Chat Logs for QLoRA

Example 3: Export Knowledge Base with Freshness Weighting

Training with Exported Data

LoRA Training (HuggingFace PEFT)

QLoRA Training (bitsandbytes)

Output Statistics

Limitations

Planned Enhancements (v2.0)

See Also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!