Architecture Overview
High-level system architecture and design of the RAG system
Architecture Overview
The RAG system follows a layered architecture with clear separation of concerns between the API surface, core business logic, service integrations, and storage backends.
System Architecture Diagram
Layered Architecture
Layer 1 — API Layer
The FastAPI server exposes all system operations through REST endpoints. It handles:
- Authentication — API key validation via
X-API-Keyheader - Request validation — Pydantic schema enforcement
- Dependency injection — Singleton instances of core components
- Background tasks — ThreadPoolExecutor-based async processing
- CORS — Configurable cross-origin resource sharing
Layer 2 — Core Layer
The central business logic layer containing three main orchestrators:
| Component | Role |
|---|---|
| RAGSystem | Coordinates query → retrieval → LLM generation pipeline |
| DocumentManager | Manages document CRUD (add, upsert, delete) across both stores |
| SemanticRetriever | Performs hybrid search with vector + BM25 + reranking |
Layer 3 — Service Layer
Standalone services that each handle a specific concern:
| Service | Purpose | External API |
|---|---|---|
| Embedder | Generate vector embeddings | Ollama /api/embed |
| Reranker | Cross-encoder relevance scoring | TEI /rerank |
| LLM Client | Answer generation | Ollama /api/chat |
| BM25 Index | Sparse keyword search | In-memory |
Layer 4 — Storage Layer
Dual-storage architecture for different data needs:
| Store | Data | Use Case |
|---|---|---|
| ChromaDB | Vector embeddings + document text | Semantic similarity search |
| PostgreSQL | Source metadata, duplicate logs | CRUD operations, dedup tracking |
Layer 5 — External Services
All external service calls use async HTTP clients (httpx.AsyncClient) for non-blocking I/O:
| Service | Protocol | Purpose |
|---|---|---|
| Ollama | HTTP REST | Embeddings (bge-m3) and LLM (gemma3) |
| HuggingFace TEI | HTTP REST | Cross-encoder reranking (bge-reranker-v2-m3) |
Component Architecture Diagram
Key Design Decisions
Hybrid Search with RRF Fusion
The retrieval pipeline combines vector search (semantic similarity) and BM25 (keyword matching) using Reciprocal Rank Fusion (RRF). This provides better recall than either approach alone — semantic search handles paraphrases and meaning, while BM25 catches exact term matches.
Async-First Architecture
All external HTTP calls use httpx.AsyncClient for non-blocking I/O. The embedding and BM25 search steps run in parallel via asyncio.gather(), while CPU-bound operations (ChromaDB queries, BM25 indexing) are offloaded to a thread pool.
Dual Storage
ChromaDB stores vector embeddings for fast similarity search, while PostgreSQL tracks document metadata, source provenance, and duplicate detection logs. This separation allows each store to be optimized for its specific access pattern.
Stateless API with Background Tasks
The API server is stateless — all state lives in ChromaDB and PostgreSQL. Background tasks (async scraping, async uploads) use a ThreadPoolExecutor with in-memory tracking. This keeps the deployment simple while supporting long-running operations.