Core Modules Overview

The RAG system is organized into four module groups under the src/ directory.

Module Groups

Preprocessing

The document processing pipeline that transforms raw text into clean, chunked, and deduplicated content ready for embedding.

Module	Purpose
`pipeline.py`	RAGPreprocessor — orchestrates the full pipeline
`cleaner.py`	HTML/JS/CSS removal, encoding fixes
`chunker.py`	Text splitting into ~500 word chunks
`semantic_chunker.py`	Heading-aware chunking for structured documents
`deduplicator.py`	MD5 hash + Jaccard similarity duplicate detection
`language_detector.py`	50+ language detection via langdetect
`arabic_text_fixer.py`	NFKC normalization, RTL text handling

Retrieval

The search and generation pipeline that finds relevant documents and generates answers.

Module	Purpose
`rag.py`	RAGSystem — main orchestrator (async + sync)
`retriever.py`	SemanticRetriever — hybrid search with RRF fusion
`embedder.py`	Vector embedding via Ollama
`reranker.py`	Cross-encoder reranking via HuggingFace TEI
`llm.py`	LLM answer generation via Ollama
`bm25.py`	BM25 sparse keyword search

Storage

Vector store and database components for persistent data.

Module	Purpose
`vector_store.py`	ChromaDB wrapper (local/remote)
`document_manager.py`	Unified CRUD across both stores
`database/models.py`	SQLAlchemy ORM models
`database/repository.py`	CRUD operations with transactions
`database/session.py`	PostgreSQL connection management

Ingestion

Content acquisition from external sources.

Module	Purpose
`scraper.py`	Web page scraping via BeautifulSoup
`uploader.py`	File upload processing
`document_parser.py`	Docling-based multi-format document parser

File Structure

src/
├── __init__.py               # Package exports
│
│ # Preprocessing
├── pipeline.py               # RAGPreprocessor class
├── cleaner.py                # TextCleaner (HTML, JS, symbols)
├── chunker.py                # TextChunker (~500 words)
├── semantic_chunker.py       # Heading-based chunking
├── deduplicator.py           # Duplicate detection
├── language_detector.py      # Language detection
├── arabic_text_fixer.py      # Arabic NFKC normalization
├── formatter.py              # Output formatting
│
│ # Retrieval
├── rag.py                    # RAGSystem orchestrator
├── retriever.py              # SemanticRetriever (hybrid search)
├── embedder.py               # Ollama embeddings (async)
├── reranker.py               # Cross-encoder reranking (async)
├── llm.py                    # Ollama LLM client (async)
├── bm25.py                   # BM25 sparse search
│
│ # Storage
├── vector_store.py           # ChromaDB wrapper
├── document_manager.py       # Unified document CRUD
│
│ # Ingestion
├── scraper.py                # WebScraper
├── uploader.py               # FileUploader
├── document_parser.py        # Docling document parser
├── incremental_parser.py     # Streaming parser
│
│ # Database
└── database/
    ├── __init__.py
    ├── models.py             # SQLAlchemy models
    ├── repository.py         # CRUD operations
    └── session.py            # Connection management

Core Modules Overview

Core Modules Overview

Module Groups

Preprocessing

Retrieval

Storage

Ingestion

File Structure

On this page