Mjara Docs
Architecture

System Components

Detailed breakdown of every component in the RAG system

System Components

This page documents every component in the RAG system, organized by layer.

API Components

FastAPI App (api/main.py)

The application entry point. Sets up:

  • Lifespan handler — initializes and shuts down core services
  • CORS middleware — configurable cross-origin access
  • Exception handlers — consistent JSON error responses
  • Router registration — mounts all endpoint groups under /api/v1

Auth Middleware (api/auth.py)

Validates X-API-Key header on all protected routes.

Exempt paths: /, /docs, /redoc, /openapi.json, /api/v1/health/*

When API_KEYS is empty in .env, authentication is disabled for development.

Dependencies (api/dependencies.py)

Provides singleton instances via FastAPI's dependency injection:

DependencyReturns
get_rag_system()RAGSystem — main orchestrator
get_scraper()WebScraper — URL scraping
get_uploader()FileUploader — file processing
get_task_manager()TaskManager — background tasks

Task Manager (api/services/task_manager.py)

Manages background tasks using ThreadPoolExecutor:

  • Creates tasks with unique UUIDs
  • Tracks status: pendingrunningcompleted / failed / cancelled
  • Reports progress for long-running operations
  • Future: database persistence for task state

Core Components

RAGSystem (src/rag.py)

The main orchestrator that coordinates the full RAG pipeline. Provides both sync and async APIs.

Key methods:

MethodDescription
query(question) / query_async(question)Full RAG: retrieve → rerank → generate answer
search(query) / search_async(query)Semantic search only (no LLM)
add_documents(docs)Add documents to the vector store
upsert_documents(docs)Smart update — only re-embeds changed content
delete_document(id)Delete a single document
delete_by_source(source)Delete all documents from a source
health_check()Verify all service connections

DocumentManager (src/document_manager.py)

Wraps ChromaDB and PostgreSQL operations into a unified CRUD interface:

  • Add — embed and store new documents
  • Upsert — detect changes via content hash, only re-embed if changed
  • Delete — remove from both ChromaDB and PostgreSQL
  • Query — list, filter, and paginate documents

SemanticRetriever (src/retriever.py)

Performs hybrid search combining multiple retrieval strategies:

  1. Embed the query via Ollama
  2. Vector search — cosine similarity in ChromaDB (top 20)
  3. BM25 search — TF-IDF keyword matching (top 20)
  4. RRF fusion — merge and rank results from both searches
  5. Rerank — cross-encoder scoring via TEI (return top 5)

Processing Pipeline Components

RAGPreprocessor (src/pipeline.py)

Orchestrates the full document processing pipeline:

Raw Text → Clean → Chunk → Deduplicate → Detect Language → Output

TextCleaner (src/cleaner.py)

Cleans raw text by removing noise:

  • HTML tags, JavaScript, CSS blocks
  • Encoding issues (mojibake) via ftfy
  • Unwanted symbols and excessive whitespace
  • URLs (optional)
  • Preserves Arabic/English text integrity

TextChunker (src/chunker.py)

Splits documents into manageable chunks:

  • Target: ~500 words per chunk (configurable)
  • Min/Max: 100-600 words (configurable)
  • Overlap: 50 words between chunks for context continuity
  • Respects paragraph and sentence boundaries

SemanticChunker (src/semantic_chunker.py)

Heading-aware chunking for structured documents:

  • Splits on heading boundaries (H1-H6)
  • Preserves document hierarchy in chunk metadata
  • Handles both LTR and RTL (Arabic) text

Deduplicator (src/deduplicator.py)

Detects and removes duplicate content:

  • MD5 hash — exact content match detection
  • Jaccard similarity — near-duplicate detection
  • Configurable similarity threshold

LanguageDetector (src/language_detector.py)

Detects document language using the langdetect library:

  • Supports 50+ languages
  • Special handling for Arabic, Hebrew, Farsi, Urdu (RTL)
  • Adds language metadata to each chunk

ArabicTextFixer (src/arabic_text_fixer.py)

Specialized Arabic text processing:

  • NFKC normalization — converts Arabic Presentation Forms (U+FE70-U+FEFF) to standard Arabic (U+0600-U+06FF)
  • Format-aware processing — different strategies based on document type:
Document TypeText ReversalColumn Reversal
PDF (Presentation Forms)NoNo
PDF (Standard Arabic)YesYes
DOCXYesNo

Service Components

Embedder (src/embedder.py)

Generates vector embeddings via Ollama's /api/embed endpoint:

  • Async HTTP — non-blocking embedding generation
  • Batch support — embed multiple texts with concurrency control
  • Health check — verifies Ollama connectivity
  • Local fallback — optional sentence-transformers for offline use

Reranker (src/reranker.py)

Cross-encoder reranking via HuggingFace TEI:

  • Scores query-document pairs for relevance
  • Uses BAAI/bge-reranker-v2-m3 (multilingual)
  • Async HTTP calls to TEI /rerank endpoint
  • Can be disabled via RERANKER_ENABLED=false

LLM Client (src/llm.py)

Ollama LLM client for answer generation:

  • Async HTTP to Ollama /api/chat
  • Configurable temperature and timeout
  • Provides the query + retrieved context as a prompt
  • Health check via /api/tags

BM25 Index (src/bm25.py)

In-memory sparse keyword search:

  • TF-IDF based scoring
  • Builds index from all documents in the collection
  • Returns ranked results by keyword relevance
  • Complements vector search for exact term matching

Storage Components

VectorStore (src/vector_store.py)

ChromaDB wrapper supporting both local and remote backends:

ModeConfiguration
LocalPersistent storage at VECTOR_PERSIST_PATH
RemoteHTTP client to ChromaDB server

Operations: add, query, get, delete, count, list collections.

Database (src/database/)

PostgreSQL integration via SQLAlchemy:

FilePurpose
models.pySQLAlchemy ORM models (Source, Document, DuplicateLog)
session.pyConnection management and session factory
repository.pyCRUD operations with transaction support

Database Schema

Ingestion Components

WebScraper (src/scraper.py)

Scrapes web pages and processes them through the RAG pipeline:

  1. Fetch URL content via requests
  2. Parse HTML with BeautifulSoup4
  3. Check for duplicate URL (hash comparison)
  4. Run through processing pipeline
  5. Embed and store in ChromaDB + PostgreSQL

FileUploader (src/uploader.py)

Processes uploaded files through the RAG pipeline:

  1. Save file temporarily
  2. Parse with DocumentParser (Docling)
  3. Check for duplicate content (hash comparison)
  4. Run through processing pipeline
  5. Embed and store in ChromaDB + PostgreSQL

DocumentParser (src/document_parser.py)

Docling-based document parser supporting multiple formats:

FeatureDescription
FormatsPDF, DOCX, PPTX, HTML, Markdown, CSV, XLSX, images
OCRTesseract + RapidOCR for scanned documents
TablesStructure-preserving table extraction
ImagesVLM-based image/chart description
OutputPlain text, Markdown, tables, images as separate objects

External Services Integration

On this page