System Components
Detailed breakdown of every component in the RAG system
System Components
This page documents every component in the RAG system, organized by layer.
API Components
FastAPI App (api/main.py)
The application entry point. Sets up:
- Lifespan handler — initializes and shuts down core services
- CORS middleware — configurable cross-origin access
- Exception handlers — consistent JSON error responses
- Router registration — mounts all endpoint groups under
/api/v1
Auth Middleware (api/auth.py)
Validates X-API-Key header on all protected routes.
Exempt paths: /, /docs, /redoc, /openapi.json, /api/v1/health/*
When API_KEYS is empty in .env, authentication is disabled for development.
Dependencies (api/dependencies.py)
Provides singleton instances via FastAPI's dependency injection:
| Dependency | Returns |
|---|---|
get_rag_system() | RAGSystem — main orchestrator |
get_scraper() | WebScraper — URL scraping |
get_uploader() | FileUploader — file processing |
get_task_manager() | TaskManager — background tasks |
Task Manager (api/services/task_manager.py)
Manages background tasks using ThreadPoolExecutor:
- Creates tasks with unique UUIDs
- Tracks status:
pending→running→completed/failed/cancelled - Reports progress for long-running operations
- Future: database persistence for task state
Core Components
RAGSystem (src/rag.py)
The main orchestrator that coordinates the full RAG pipeline. Provides both sync and async APIs.
Key methods:
| Method | Description |
|---|---|
query(question) / query_async(question) | Full RAG: retrieve → rerank → generate answer |
search(query) / search_async(query) | Semantic search only (no LLM) |
add_documents(docs) | Add documents to the vector store |
upsert_documents(docs) | Smart update — only re-embeds changed content |
delete_document(id) | Delete a single document |
delete_by_source(source) | Delete all documents from a source |
health_check() | Verify all service connections |
DocumentManager (src/document_manager.py)
Wraps ChromaDB and PostgreSQL operations into a unified CRUD interface:
- Add — embed and store new documents
- Upsert — detect changes via content hash, only re-embed if changed
- Delete — remove from both ChromaDB and PostgreSQL
- Query — list, filter, and paginate documents
SemanticRetriever (src/retriever.py)
Performs hybrid search combining multiple retrieval strategies:
- Embed the query via Ollama
- Vector search — cosine similarity in ChromaDB (top 20)
- BM25 search — TF-IDF keyword matching (top 20)
- RRF fusion — merge and rank results from both searches
- Rerank — cross-encoder scoring via TEI (return top 5)
Processing Pipeline Components
RAGPreprocessor (src/pipeline.py)
Orchestrates the full document processing pipeline:
Raw Text → Clean → Chunk → Deduplicate → Detect Language → OutputTextCleaner (src/cleaner.py)
Cleans raw text by removing noise:
- HTML tags, JavaScript, CSS blocks
- Encoding issues (mojibake) via
ftfy - Unwanted symbols and excessive whitespace
- URLs (optional)
- Preserves Arabic/English text integrity
TextChunker (src/chunker.py)
Splits documents into manageable chunks:
- Target: ~500 words per chunk (configurable)
- Min/Max: 100-600 words (configurable)
- Overlap: 50 words between chunks for context continuity
- Respects paragraph and sentence boundaries
SemanticChunker (src/semantic_chunker.py)
Heading-aware chunking for structured documents:
- Splits on heading boundaries (H1-H6)
- Preserves document hierarchy in chunk metadata
- Handles both LTR and RTL (Arabic) text
Deduplicator (src/deduplicator.py)
Detects and removes duplicate content:
- MD5 hash — exact content match detection
- Jaccard similarity — near-duplicate detection
- Configurable similarity threshold
LanguageDetector (src/language_detector.py)
Detects document language using the langdetect library:
- Supports 50+ languages
- Special handling for Arabic, Hebrew, Farsi, Urdu (RTL)
- Adds
languagemetadata to each chunk
ArabicTextFixer (src/arabic_text_fixer.py)
Specialized Arabic text processing:
- NFKC normalization — converts Arabic Presentation Forms (U+FE70-U+FEFF) to standard Arabic (U+0600-U+06FF)
- Format-aware processing — different strategies based on document type:
| Document Type | Text Reversal | Column Reversal |
|---|---|---|
| PDF (Presentation Forms) | No | No |
| PDF (Standard Arabic) | Yes | Yes |
| DOCX | Yes | No |
Service Components
Embedder (src/embedder.py)
Generates vector embeddings via Ollama's /api/embed endpoint:
- Async HTTP — non-blocking embedding generation
- Batch support — embed multiple texts with concurrency control
- Health check — verifies Ollama connectivity
- Local fallback — optional
sentence-transformersfor offline use
Reranker (src/reranker.py)
Cross-encoder reranking via HuggingFace TEI:
- Scores query-document pairs for relevance
- Uses
BAAI/bge-reranker-v2-m3(multilingual) - Async HTTP calls to TEI
/rerankendpoint - Can be disabled via
RERANKER_ENABLED=false
LLM Client (src/llm.py)
Ollama LLM client for answer generation:
- Async HTTP to Ollama
/api/chat - Configurable temperature and timeout
- Provides the query + retrieved context as a prompt
- Health check via
/api/tags
BM25 Index (src/bm25.py)
In-memory sparse keyword search:
- TF-IDF based scoring
- Builds index from all documents in the collection
- Returns ranked results by keyword relevance
- Complements vector search for exact term matching
Storage Components
VectorStore (src/vector_store.py)
ChromaDB wrapper supporting both local and remote backends:
| Mode | Configuration |
|---|---|
| Local | Persistent storage at VECTOR_PERSIST_PATH |
| Remote | HTTP client to ChromaDB server |
Operations: add, query, get, delete, count, list collections.
Database (src/database/)
PostgreSQL integration via SQLAlchemy:
| File | Purpose |
|---|---|
models.py | SQLAlchemy ORM models (Source, Document, DuplicateLog) |
session.py | Connection management and session factory |
repository.py | CRUD operations with transaction support |
Database Schema
Ingestion Components
WebScraper (src/scraper.py)
Scrapes web pages and processes them through the RAG pipeline:
- Fetch URL content via
requests - Parse HTML with
BeautifulSoup4 - Check for duplicate URL (hash comparison)
- Run through processing pipeline
- Embed and store in ChromaDB + PostgreSQL
FileUploader (src/uploader.py)
Processes uploaded files through the RAG pipeline:
- Save file temporarily
- Parse with
DocumentParser(Docling) - Check for duplicate content (hash comparison)
- Run through processing pipeline
- Embed and store in ChromaDB + PostgreSQL
DocumentParser (src/document_parser.py)
Docling-based document parser supporting multiple formats:
| Feature | Description |
|---|---|
| Formats | PDF, DOCX, PPTX, HTML, Markdown, CSV, XLSX, images |
| OCR | Tesseract + RapidOCR for scanned documents |
| Tables | Structure-preserving table extraction |
| Images | VLM-based image/chart description |
| Output | Plain text, Markdown, tables, images as separate objects |