Mjara Docs
Core Modules

Preprocessing Pipeline

Text cleaning, chunking, deduplication, and language detection modules

Preprocessing Pipeline

The preprocessing pipeline transforms raw document text into clean, chunked, and annotated content ready for embedding and storage.

Pipeline Flow

RAGPreprocessor (pipeline.py)

The orchestrator that runs all preprocessing steps in sequence.

from src import RAGPreprocessor

preprocessor = RAGPreprocessor(
    target_chunk_words=500,
    min_chunk_words=100,
    max_chunk_words=600,
    overlap_words=50,
    dedupe_similarity_threshold=0.85,
    normalize_arabic=False,
    remove_urls=True,
)

# Process a batch of documents
chunks = preprocessor.process_documents([
    {"text": "Raw content...", "source": "web", "id": "doc-001"},
])

# Get processing statistics
stats = preprocessor.get_stats(chunks)

TextCleaner (cleaner.py)

Removes noise from raw text:

  • HTML tags — strips all HTML markup
  • JavaScript/CSS — removes <script> and <style> blocks
  • Encoding fixes — repairs mojibake via ftfy
  • Unwanted symbols — removes control characters and special symbols
  • URLs — optionally removes HTTP/HTTPS URLs
  • Whitespace — normalizes excessive whitespace and blank lines
from src.cleaner import TextCleaner

cleaner = TextCleaner(normalize_arabic=False, remove_urls=True)
clean = cleaner.clean("<p>Hello <script>alert('x')</script> World</p>")
# Result: "Hello World"

ArabicTextFixer (arabic_text_fixer.py)

Specialized processing for Arabic/RTL text:

  • NFKC normalization — converts Arabic Presentation Forms (U+FE70-U+FEFF) to standard Arabic letters (U+0600-U+06FF)
  • Format-aware logic — adapts behavior based on source document type:
Source TypeText ReversalColumn Reversal
PDF with Presentation FormsNoNo
PDF with Standard ArabicYesYes
DOCXYesNo
  • Adds metadata flags: had_presentation_forms, skip_column_reversal

TextChunker (chunker.py)

Splits documents into smaller chunks for embedding:

  • Target size: ~500 words (configurable)
  • Min/Max bounds: 100-600 words
  • Overlap: 50 words between consecutive chunks
  • Boundary respect: splits at paragraph and sentence boundaries when possible
  • Metadata: each chunk tracks chunk_index and total_chunks
from src.chunker import TextChunker

chunker = TextChunker(
    target_words=500,
    min_words=100,
    max_words=600,
    overlap_words=50,
)
chunks = chunker.chunk(text, doc_id="doc-001")

SemanticChunker (semantic_chunker.py)

Heading-aware chunking for structured documents (Markdown, HTML):

  • Splits on heading boundaries (H1 through H6)
  • Preserves document hierarchy in metadata
  • Handles both LTR and RTL text directions
  • Falls back to TextChunker for unstructured content

Deduplicator (deduplicator.py)

Removes duplicate content at two levels:

  1. Exact duplicates — MD5 hash comparison
  2. Near-duplicates — Jaccard similarity with configurable threshold (default: 0.85)
from src.deduplicator import Deduplicator

dedup = Deduplicator(similarity_threshold=0.85)
unique_chunks = dedup.deduplicate(chunks)

LanguageDetector (language_detector.py)

Detects the language of each chunk using the langdetect library:

  • Supports 50+ languages
  • Adds language field to chunk metadata (e.g., "en", "ar")
  • Identifies RTL languages: Arabic, Hebrew, Farsi, Urdu
  • Used for language-aware processing downstream

On this page