Mjara Docs
Core Modules

Core Modules Overview

Overview of the RAG system's core Python modules

Core Modules Overview

The RAG system is organized into four module groups under the src/ directory.

Module Groups

Preprocessing

The document processing pipeline that transforms raw text into clean, chunked, and deduplicated content ready for embedding.

ModulePurpose
pipeline.pyRAGPreprocessor — orchestrates the full pipeline
cleaner.pyHTML/JS/CSS removal, encoding fixes
chunker.pyText splitting into ~500 word chunks
semantic_chunker.pyHeading-aware chunking for structured documents
deduplicator.pyMD5 hash + Jaccard similarity duplicate detection
language_detector.py50+ language detection via langdetect
arabic_text_fixer.pyNFKC normalization, RTL text handling

Retrieval

The search and generation pipeline that finds relevant documents and generates answers.

ModulePurpose
rag.pyRAGSystem — main orchestrator (async + sync)
retriever.pySemanticRetriever — hybrid search with RRF fusion
embedder.pyVector embedding via Ollama
reranker.pyCross-encoder reranking via HuggingFace TEI
llm.pyLLM answer generation via Ollama
bm25.pyBM25 sparse keyword search

Storage

Vector store and database components for persistent data.

ModulePurpose
vector_store.pyChromaDB wrapper (local/remote)
document_manager.pyUnified CRUD across both stores
database/models.pySQLAlchemy ORM models
database/repository.pyCRUD operations with transactions
database/session.pyPostgreSQL connection management

Ingestion

Content acquisition from external sources.

ModulePurpose
scraper.pyWeb page scraping via BeautifulSoup
uploader.pyFile upload processing
document_parser.pyDocling-based multi-format document parser

File Structure

src/
├── __init__.py               # Package exports

│ # Preprocessing
├── pipeline.py               # RAGPreprocessor class
├── cleaner.py                # TextCleaner (HTML, JS, symbols)
├── chunker.py                # TextChunker (~500 words)
├── semantic_chunker.py       # Heading-based chunking
├── deduplicator.py           # Duplicate detection
├── language_detector.py      # Language detection
├── arabic_text_fixer.py      # Arabic NFKC normalization
├── formatter.py              # Output formatting

│ # Retrieval
├── rag.py                    # RAGSystem orchestrator
├── retriever.py              # SemanticRetriever (hybrid search)
├── embedder.py               # Ollama embeddings (async)
├── reranker.py               # Cross-encoder reranking (async)
├── llm.py                    # Ollama LLM client (async)
├── bm25.py                   # BM25 sparse search

│ # Storage
├── vector_store.py           # ChromaDB wrapper
├── document_manager.py       # Unified document CRUD

│ # Ingestion
├── scraper.py                # WebScraper
├── uploader.py               # FileUploader
├── document_parser.py        # Docling document parser
├── incremental_parser.py     # Streaming parser

│ # Database
└── database/
    ├── __init__.py
    ├── models.py             # SQLAlchemy models
    ├── repository.py         # CRUD operations
    └── session.py            # Connection management

On this page