System Components

This page documents every component in the RAG system, organized by layer.

API Components

FastAPI App (`api/main.py`)

The application entry point. Sets up:

Lifespan handler — initializes and shuts down core services
CORS middleware — configurable cross-origin access
Exception handlers — consistent JSON error responses
Router registration — mounts all endpoint groups under /api/v1

Auth Middleware (`api/auth.py`)

Validates X-API-Key header on all protected routes.

Exempt paths: /, /docs, /redoc, /openapi.json, /api/v1/health/*

When API_KEYS is empty in .env, authentication is disabled for development.

Dependencies (`api/dependencies.py`)

Provides singleton instances via FastAPI's dependency injection:

Dependency	Returns
`get_rag_system()`	`RAGSystem` — main orchestrator
`get_scraper()`	`WebScraper` — URL scraping
`get_uploader()`	`FileUploader` — file processing
`get_task_manager()`	`TaskManager` — background tasks

Task Manager (`api/services/task_manager.py`)

Manages background tasks using ThreadPoolExecutor:

Creates tasks with unique UUIDs
Tracks status: pending → running → completed / failed / cancelled
Reports progress for long-running operations
Future: database persistence for task state

Core Components

RAGSystem (`src/rag.py`)

The main orchestrator that coordinates the full RAG pipeline. Provides both sync and async APIs.

Key methods:

Method	Description
`query(question)` / `query_async(question)`	Full RAG: retrieve → rerank → generate answer
`search(query)` / `search_async(query)`	Semantic search only (no LLM)
`add_documents(docs)`	Add documents to the vector store
`upsert_documents(docs)`	Smart update — only re-embeds changed content
`delete_document(id)`	Delete a single document
`delete_by_source(source)`	Delete all documents from a source
`health_check()`	Verify all service connections

DocumentManager (`src/document_manager.py`)

Wraps ChromaDB and PostgreSQL operations into a unified CRUD interface:

Add — embed and store new documents
Upsert — detect changes via content hash, only re-embed if changed
Delete — remove from both ChromaDB and PostgreSQL
Query — list, filter, and paginate documents

SemanticRetriever (`src/retriever.py`)

Performs hybrid search combining multiple retrieval strategies:

Embed the query via Ollama
Vector search — cosine similarity in ChromaDB (top 20)
BM25 search — TF-IDF keyword matching (top 20)
RRF fusion — merge and rank results from both searches
Rerank — cross-encoder scoring via TEI (return top 5)

Processing Pipeline Components

RAGPreprocessor (`src/pipeline.py`)

Orchestrates the full document processing pipeline:

Raw Text → Clean → Chunk → Deduplicate → Detect Language → Output

TextCleaner (`src/cleaner.py`)

Cleans raw text by removing noise:

HTML tags, JavaScript, CSS blocks
Encoding issues (mojibake) via ftfy
Unwanted symbols and excessive whitespace
URLs (optional)
Preserves Arabic/English text integrity

TextChunker (`src/chunker.py`)

Splits documents into manageable chunks:

Target: ~500 words per chunk (configurable)
Min/Max: 100-600 words (configurable)
Overlap: 50 words between chunks for context continuity
Respects paragraph and sentence boundaries

SemanticChunker (`src/semantic_chunker.py`)

Heading-aware chunking for structured documents:

Splits on heading boundaries (H1-H6)
Preserves document hierarchy in chunk metadata
Handles both LTR and RTL (Arabic) text

Deduplicator (`src/deduplicator.py`)

Detects and removes duplicate content:

MD5 hash — exact content match detection
Jaccard similarity — near-duplicate detection
Configurable similarity threshold

LanguageDetector (`src/language_detector.py`)

Detects document language using the langdetect library:

Supports 50+ languages
Special handling for Arabic, Hebrew, Farsi, Urdu (RTL)
Adds language metadata to each chunk

ArabicTextFixer (`src/arabic_text_fixer.py`)

Specialized Arabic text processing:

NFKC normalization — converts Arabic Presentation Forms (U+FE70-U+FEFF) to standard Arabic (U+0600-U+06FF)
Format-aware processing — different strategies based on document type:

Document Type	Text Reversal	Column Reversal
PDF (Presentation Forms)	No	No
PDF (Standard Arabic)	Yes	Yes
DOCX	Yes	No

Service Components

Embedder (`src/embedder.py`)

Generates vector embeddings via Ollama's /api/embed endpoint:

Async HTTP — non-blocking embedding generation
Batch support — embed multiple texts with concurrency control
Health check — verifies Ollama connectivity
Local fallback — optional sentence-transformers for offline use

Reranker (`src/reranker.py`)

Cross-encoder reranking via HuggingFace TEI:

Scores query-document pairs for relevance
Uses BAAI/bge-reranker-v2-m3 (multilingual)
Async HTTP calls to TEI /rerank endpoint
Can be disabled via RERANKER_ENABLED=false

LLM Client (`src/llm.py`)

Ollama LLM client for answer generation:

Async HTTP to Ollama /api/chat
Configurable temperature and timeout
Provides the query + retrieved context as a prompt
Health check via /api/tags

BM25 Index (`src/bm25.py`)

In-memory sparse keyword search:

TF-IDF based scoring
Builds index from all documents in the collection
Returns ranked results by keyword relevance
Complements vector search for exact term matching

Storage Components

VectorStore (`src/vector_store.py`)

ChromaDB wrapper supporting both local and remote backends:

Mode	Configuration
Local	Persistent storage at `VECTOR_PERSIST_PATH`
Remote	HTTP client to ChromaDB server

Operations: add, query, get, delete, count, list collections.

Database (`src/database/`)

PostgreSQL integration via SQLAlchemy:

File	Purpose
`models.py`	SQLAlchemy ORM models (Source, Document, DuplicateLog)
`session.py`	Connection management and session factory
`repository.py`	CRUD operations with transaction support

Database Schema

Ingestion Components

WebScraper (`src/scraper.py`)

Scrapes web pages and processes them through the RAG pipeline:

Fetch URL content via requests
Parse HTML with BeautifulSoup4
Check for duplicate URL (hash comparison)
Run through processing pipeline
Embed and store in ChromaDB + PostgreSQL

FileUploader (`src/uploader.py`)

Processes uploaded files through the RAG pipeline:

Save file temporarily
Parse with DocumentParser (Docling)
Check for duplicate content (hash comparison)
Run through processing pipeline
Embed and store in ChromaDB + PostgreSQL

DocumentParser (`src/document_parser.py`)

Docling-based document parser supporting multiple formats:

Feature	Description
Formats	PDF, DOCX, PPTX, HTML, Markdown, CSV, XLSX, images
OCR	Tesseract + RapidOCR for scanned documents
Tables	Structure-preserving table extraction
Images	VLM-based image/chart description
Output	Plain text, Markdown, tables, images as separate objects

System Components

System Components

API Components

FastAPI App (`api/main.py`)

Auth Middleware (`api/auth.py`)

Dependencies (`api/dependencies.py`)

Task Manager (`api/services/task_manager.py`)

Core Components

RAGSystem (`src/rag.py`)

DocumentManager (`src/document_manager.py`)

SemanticRetriever (`src/retriever.py`)

Processing Pipeline Components

RAGPreprocessor (`src/pipeline.py`)

TextCleaner (`src/cleaner.py`)

TextChunker (`src/chunker.py`)

SemanticChunker (`src/semantic_chunker.py`)

Deduplicator (`src/deduplicator.py`)

LanguageDetector (`src/language_detector.py`)

ArabicTextFixer (`src/arabic_text_fixer.py`)

Service Components

Embedder (`src/embedder.py`)

Reranker (`src/reranker.py`)

LLM Client (`src/llm.py`)

BM25 Index (`src/bm25.py`)

Storage Components

VectorStore (`src/vector_store.py`)

Database (`src/database/`)

Database Schema

Ingestion Components

WebScraper (`src/scraper.py`)

FileUploader (`src/uploader.py`)

DocumentParser (`src/document_parser.py`)

External Services Integration

On this page