Demo and starter kit for a RAG application using OpenAI's API and Hono.js
Enhanced Version Available! This demo now includes an enhanced RAG implementation with intelligent document chunking and hybrid search. Run
pnpm dev:enhancedto try it out.
This application demonstrates a complete Retrieval-Augmented Generation (RAG) pipeline that allows users to ask questions about custom datasets. Here's how it works:
The demo implements semantic search using OpenAI's embedding and completion APIs, following the principles outlined in OpenAI's Retrieval Guide.
RAG Pipeline:
- Document Embedding: All documents in a dataset are converted to embeddings using OpenAI's
text-embedding-ada-002model - Query Processing: User questions are embedded using the same model
- Semantic Search: The system finds the most relevant documents using cosine similarity between embeddings
- Context Injection: Relevant documents are injected into a prompt template as context
- AI Generation: OpenAI's
gpt-3.5-turbogenerates responses based on the context and question
Key AI Components:
- Embeddings: Convert text to vector representations for semantic similarity
- Embedding Cache: Automatically cache embeddings by content hash to avoid redundant API calls
- Semantic Search: Find relevant documents without exact keyword matching
- Prompt Engineering: System prompts guide the AI's behavior and response style
- Context Window Management: Only the most relevant documents are included to stay within token limits
Each dataset is a folder in the data/ directory containing three files:
data/
├── example-fruits/
│ ├── docs.md # Documents separated by ***
│ ├── system-prompt.md # AI behavior instructions
│ ├── user-template.md # Message template with {{context}} and {{question}} placeholders
│ └── embeddings/ # Cached embeddings (auto-generated)
├── example-nodejs/
│ ├── docs.md
│ ├── system-prompt.md
│ ├── user-template.md
│ └── embeddings/ # Cached embeddings (auto-generated)
└── ...
Document Format (docs.md):
Documents are stored in markdown format, separated by *** lines. YAML frontmatter is automatically stripped.
---
description: "Example dataset"
---
First document content here.
***
Second document with more information.
***
Third document about something else.Embedding Cache (embeddings/):
The system automatically caches embeddings to avoid redundant API calls and improve performance. Embeddings are organized by provider and model in subfolders: embeddings/{provider}/{model}/. Each document's embedding is stored as a JSON file named by the SHA256 hash of its content:
- Provider Isolation: Different AI providers and models use separate cache directories
- Cache Invalidation: When document content changes, the hash changes, automatically invalidating old embeddings
- Performance: Subsequent runs load embeddings from cache instead of calling the API
- Model Support: Works with OpenAI, LM Studio, and other embedding providers
- Storage Format: Each cache file contains the embedding vector, original text, content hash, and metadata
The project includes convenient scripts for managing cached embeddings:
Generate embeddings for a dataset:
pnpm embeddings:generate example-fruitsCreates embedding cache files for all documents in the specified dataset. Requires OPENAI_API_KEY environment variable.
Clean unused cache files:
pnpm embeddings:clean example-fruits Removes cache files that no longer correspond to current document content (useful after editing documents).
Update embeddings (generate + clean):
pnpm embeddings:update example-fruitsRuns both generate and clean operations in sequence - the recommended way to refresh embeddings after making changes.
The data loading system is modular and extensible. The core types are:
/**
* Abstract base class for loading documents from various sources.
* Implement this class to support different data sources like JSON files,
* databases, APIs, etc.
*
* @example
* ```typescript
* class DatabaseDocumentLoader extends DocumentLoader {
* async loadDocuments(dataSet: string): Promise<Doc[]> {
* const rows = await db.query('SELECT * FROM docs WHERE dataset = ?', [dataSet]);
* return rows.map(row => ({ id: row.id, text: row.content }));
* }
* }
* ```
*/
export abstract class DocumentLoader {
/**
* Load documents for a given dataset.
* @param dataSet - The name/identifier of the dataset to load
* @returns Promise resolving to an array of documents
*/
abstract loadDocuments(dataSet: string): Promise<Doc[]>;
}To add your own dataset:
- Create a new folder in
data/with your dataset name - Add the three required markdown files
- The system will automatically discover and load your dataset
To customize data loading:
- Extend the
DocumentLoaderclass insrc/dataset/DocumentLoader.ts - Implement your custom
loadDocuments()method - Update the semantic search to use your custom loader
The application uses Hono.js as a lightweight web framework:
- Route Handling: Simple GET/POST routes for the web interface
- Form Processing: Native HTML forms submit questions via POST
- Server-Side Rendering: HTML is generated server-side with embedded results
- Static UI: No JavaScript frontend - pure HTML forms and server responses
Request Flow:
- User visits
/and selects a dataset - User submits question via HTML form
- Server processes question through RAG pipeline
- Server renders results page with answer and debug information
- Debug section shows the complete request/response data and context used
The UI is intentionally minimal to focus on the RAG functionality rather than frontend complexity.
The enhanced version (pnpm dev:enhanced) includes significant improvements for better retrieval quality:
- Automatic Chunking: Large documents are automatically split into ~500 token chunks
- Sentence Preservation: Chunks respect sentence boundaries for better readability
- Overlapping Context: Chunks include overlapping content to preserve context
- Smart Threshold: Only documents >1.5x chunk size are split (avoids unnecessary chunking)
- Dual Scoring: Combines embedding similarity with keyword matching
- Better Precision: Excellent for finding specific terms, error codes, or technical details
- Configurable Weights: Adjust the balance between semantic and keyword search
- Keyword Highlighting: Shows matching keywords in search results
- Chunk Metadata: Shows which document and chunk each result comes from
- Score Transparency: Displays relevance scores for each result
- Highlighted Excerpts: Shows keyword matches in context
- Configurable Results: Choose how many results to retrieve (1-10)
Query: "What is the API timeout?"
Basic RAG:
- Returns entire 2000-word API documentation section
- User must scan through to find timeout information
Enhanced RAG:
- Returns specific chunk: "The API timeout is set to 30 seconds by default..."
- Highlights the matching keywords
- Shows relevance score and chunk location
# Development mode with hot-reloading
pnpm dev:enhanced
# Production mode
pnpm start:enhancedThe enhanced server runs on the same port (8787) and is fully backwards compatible with existing datasets.
- Semantic search
- Retrieval-Augmented Generation (RAG)
- OpenAI embeddings & completions
- Hono.js
Download LM Studio and install the following models:
- https://lmstudio.ai
- https://huggingface.co/NathanMad/sentence-transformers_all-MiniLM-L12-v2-gguf
- https://huggingface.co/NathanMad/llama-2-13b-chat-gguf
cp .env_example_lmstudio .env
pnpm install
pnpm run dev-
Install dependencies
pnpm install
-
Set your OpenAI API key
cp .env_example_openai .env
Then edit
.envand set your OpenAI API key:OPENAI_API_KEY=sk-... -
Run the server with watching and hot-reloading
pnpm run dev
-
Open http://localhost:8787 in your browser.
- You’ll see a form to submit a question (all UI is native HTML, no styling).
- After submitting a question, you’ll see the answer and the most relevant example passages from the in-memory dataset.
pnpm dev- start development server (basic RAG)pnpm dev:enhanced- start enhanced development server (with chunking and hybrid search)pnpm start- run production server (basic RAG)pnpm start:enhanced- run production enhanced server
pnpm type-check- run TypeScript compiler in noEmit modepnpm test- run Jest testspnpm test:e2e- run Playwright end-to-end tests with mocked OpenAI responsespnpm ci- run all checks (type-check, tests, e2e)
pnpm embeddings:generate <dataset>- generate cached embeddings for a datasetpnpm embeddings:clean <dataset>- remove unused cached embeddings for a datasetpnpm embeddings:update <dataset>- generate new embeddings and clean unused ones
pnpm cache:list- list all cached embeddings organized by provider and modelpnpm cache:clear <dataset> [provider] [model]- clear specific or all cachespnpm cache:setup <dataset>- create directories for popular embedding models
Playwright is used for E2E testing with a mocked OpenAI API. Tests start the server automatically.
-
Run E2E tests:
pnpm test:e2e