Architecture Overview

The RAG system follows a layered architecture with clear separation of concerns between the API surface, core business logic, service integrations, and storage backends.

Layered Architecture

Layer 1 — API Layer

The FastAPI server exposes all system operations through REST endpoints. It handles:

Authentication — API key validation via X-API-Key header
Request validation — Pydantic schema enforcement
Dependency injection — Singleton instances of core components
Background tasks — ThreadPoolExecutor-based async processing
CORS — Configurable cross-origin resource sharing

Layer 2 — Core Layer

The central business logic layer containing three main orchestrators:

Component	Role
RAGSystem	Coordinates query → retrieval → LLM generation pipeline
DocumentManager	Manages document CRUD (add, upsert, delete) across both stores
SemanticRetriever	Performs hybrid search with vector + BM25 + reranking

Layer 3 — Service Layer

Standalone services that each handle a specific concern:

Service	Purpose	External API
Embedder	Generate vector embeddings	Ollama `/api/embed`
Reranker	Cross-encoder relevance scoring	TEI `/rerank`
LLM Client	Answer generation	Ollama `/api/chat`
BM25 Index	Sparse keyword search	In-memory

Layer 4 — Storage Layer

Dual-storage architecture for different data needs:

Store	Data	Use Case
ChromaDB	Vector embeddings + document text	Semantic similarity search
PostgreSQL	Source metadata, duplicate logs	CRUD operations, dedup tracking

Layer 5 — External Services

All external service calls use async HTTP clients (httpx.AsyncClient) for non-blocking I/O:

Service	Protocol	Purpose
Ollama	HTTP REST	Embeddings (bge-m3) and LLM (gemma3)
HuggingFace TEI	HTTP REST	Cross-encoder reranking (bge-reranker-v2-m3)

Component Architecture Diagram

The retrieval pipeline combines vector search (semantic similarity) and BM25 (keyword matching) using Reciprocal Rank Fusion (RRF). This provides better recall than either approach alone — semantic search handles paraphrases and meaning, while BM25 catches exact term matches.

Async-First Architecture

All external HTTP calls use httpx.AsyncClient for non-blocking I/O. The embedding and BM25 search steps run in parallel via asyncio.gather(), while CPU-bound operations (ChromaDB queries, BM25 indexing) are offloaded to a thread pool.

Dual Storage

ChromaDB stores vector embeddings for fast similarity search, while PostgreSQL tracks document metadata, source provenance, and duplicate detection logs. This separation allows each store to be optimized for its specific access pattern.

Stateless API with Background Tasks

The API server is stateless — all state lives in ChromaDB and PostgreSQL. Background tasks (async scraping, async uploads) use a ThreadPoolExecutor with in-memory tracking. This keeps the deployment simple while supporting long-running operations.

Architecture Overview

Architecture Overview

System Architecture Diagram

Layered Architecture

Layer 1 — API Layer

Layer 2 — Core Layer

Layer 3 — Service Layer

Layer 4 — Storage Layer

Layer 5 — External Services

Component Architecture Diagram

Key Design Decisions

Hybrid Search with RRF Fusion

Async-First Architecture

Dual Storage

Stateless API with Background Tasks

On this page