Core Modules
Core Modules Overview
Overview of the RAG system's core Python modules
Core Modules Overview
The RAG system is organized into four module groups under the src/ directory.
Module Groups
Preprocessing
The document processing pipeline that transforms raw text into clean, chunked, and deduplicated content ready for embedding.
| Module | Purpose |
|---|---|
pipeline.py | RAGPreprocessor — orchestrates the full pipeline |
cleaner.py | HTML/JS/CSS removal, encoding fixes |
chunker.py | Text splitting into ~500 word chunks |
semantic_chunker.py | Heading-aware chunking for structured documents |
deduplicator.py | MD5 hash + Jaccard similarity duplicate detection |
language_detector.py | 50+ language detection via langdetect |
arabic_text_fixer.py | NFKC normalization, RTL text handling |
Retrieval
The search and generation pipeline that finds relevant documents and generates answers.
| Module | Purpose |
|---|---|
rag.py | RAGSystem — main orchestrator (async + sync) |
retriever.py | SemanticRetriever — hybrid search with RRF fusion |
embedder.py | Vector embedding via Ollama |
reranker.py | Cross-encoder reranking via HuggingFace TEI |
llm.py | LLM answer generation via Ollama |
bm25.py | BM25 sparse keyword search |
Storage
Vector store and database components for persistent data.
| Module | Purpose |
|---|---|
vector_store.py | ChromaDB wrapper (local/remote) |
document_manager.py | Unified CRUD across both stores |
database/models.py | SQLAlchemy ORM models |
database/repository.py | CRUD operations with transactions |
database/session.py | PostgreSQL connection management |
Ingestion
Content acquisition from external sources.
| Module | Purpose |
|---|---|
scraper.py | Web page scraping via BeautifulSoup |
uploader.py | File upload processing |
document_parser.py | Docling-based multi-format document parser |
File Structure
src/
├── __init__.py # Package exports
│
│ # Preprocessing
├── pipeline.py # RAGPreprocessor class
├── cleaner.py # TextCleaner (HTML, JS, symbols)
├── chunker.py # TextChunker (~500 words)
├── semantic_chunker.py # Heading-based chunking
├── deduplicator.py # Duplicate detection
├── language_detector.py # Language detection
├── arabic_text_fixer.py # Arabic NFKC normalization
├── formatter.py # Output formatting
│
│ # Retrieval
├── rag.py # RAGSystem orchestrator
├── retriever.py # SemanticRetriever (hybrid search)
├── embedder.py # Ollama embeddings (async)
├── reranker.py # Cross-encoder reranking (async)
├── llm.py # Ollama LLM client (async)
├── bm25.py # BM25 sparse search
│
│ # Storage
├── vector_store.py # ChromaDB wrapper
├── document_manager.py # Unified document CRUD
│
│ # Ingestion
├── scraper.py # WebScraper
├── uploader.py # FileUploader
├── document_parser.py # Docling document parser
├── incremental_parser.py # Streaming parser
│
│ # Database
└── database/
├── __init__.py
├── models.py # SQLAlchemy models
├── repository.py # CRUD operations
└── session.py # Connection management