Mjara Docs
Architecture

Architecture Overview

High-level system architecture and design of the RAG system

Architecture Overview

The RAG system follows a layered architecture with clear separation of concerns between the API surface, core business logic, service integrations, and storage backends.

System Architecture Diagram

Layered Architecture

Layer 1 — API Layer

The FastAPI server exposes all system operations through REST endpoints. It handles:

  • Authentication — API key validation via X-API-Key header
  • Request validation — Pydantic schema enforcement
  • Dependency injection — Singleton instances of core components
  • Background tasks — ThreadPoolExecutor-based async processing
  • CORS — Configurable cross-origin resource sharing

Layer 2 — Core Layer

The central business logic layer containing three main orchestrators:

ComponentRole
RAGSystemCoordinates query → retrieval → LLM generation pipeline
DocumentManagerManages document CRUD (add, upsert, delete) across both stores
SemanticRetrieverPerforms hybrid search with vector + BM25 + reranking

Layer 3 — Service Layer

Standalone services that each handle a specific concern:

ServicePurposeExternal API
EmbedderGenerate vector embeddingsOllama /api/embed
RerankerCross-encoder relevance scoringTEI /rerank
LLM ClientAnswer generationOllama /api/chat
BM25 IndexSparse keyword searchIn-memory

Layer 4 — Storage Layer

Dual-storage architecture for different data needs:

StoreDataUse Case
ChromaDBVector embeddings + document textSemantic similarity search
PostgreSQLSource metadata, duplicate logsCRUD operations, dedup tracking

Layer 5 — External Services

All external service calls use async HTTP clients (httpx.AsyncClient) for non-blocking I/O:

ServiceProtocolPurpose
OllamaHTTP RESTEmbeddings (bge-m3) and LLM (gemma3)
HuggingFace TEIHTTP RESTCross-encoder reranking (bge-reranker-v2-m3)

Component Architecture Diagram

Key Design Decisions

Hybrid Search with RRF Fusion

The retrieval pipeline combines vector search (semantic similarity) and BM25 (keyword matching) using Reciprocal Rank Fusion (RRF). This provides better recall than either approach alone — semantic search handles paraphrases and meaning, while BM25 catches exact term matches.

Async-First Architecture

All external HTTP calls use httpx.AsyncClient for non-blocking I/O. The embedding and BM25 search steps run in parallel via asyncio.gather(), while CPU-bound operations (ChromaDB queries, BM25 indexing) are offloaded to a thread pool.

Dual Storage

ChromaDB stores vector embeddings for fast similarity search, while PostgreSQL tracks document metadata, source provenance, and duplicate detection logs. This separation allows each store to be optimized for its specific access pattern.

Stateless API with Background Tasks

The API server is stateless — all state lives in ChromaDB and PostgreSQL. Background tasks (async scraping, async uploads) use a ThreadPoolExecutor with in-memory tracking. This keeps the deployment simple while supporting long-running operations.

On this page