Preprocessing Pipeline

The preprocessing pipeline transforms raw document text into clean, chunked, and annotated content ready for embedding and storage.

Pipeline Flow

RAGPreprocessor (`pipeline.py`)

The orchestrator that runs all preprocessing steps in sequence.

from src import RAGPreprocessor

preprocessor = RAGPreprocessor(
    target_chunk_words=500,
    min_chunk_words=100,
    max_chunk_words=600,
    overlap_words=50,
    dedupe_similarity_threshold=0.85,
    normalize_arabic=False,
    remove_urls=True,
)

# Process a batch of documents
chunks = preprocessor.process_documents([
    {"text": "Raw content...", "source": "web", "id": "doc-001"},
])

# Get processing statistics
stats = preprocessor.get_stats(chunks)

TextCleaner (`cleaner.py`)

Removes noise from raw text:

HTML tags — strips all HTML markup
JavaScript/CSS — removes <script> and <style> blocks
Encoding fixes — repairs mojibake via ftfy
Unwanted symbols — removes control characters and special symbols
URLs — optionally removes HTTP/HTTPS URLs
Whitespace — normalizes excessive whitespace and blank lines

from src.cleaner import TextCleaner

cleaner = TextCleaner(normalize_arabic=False, remove_urls=True)
clean = cleaner.clean("<p>Hello <script>alert('x')</script> World</p>")
# Result: "Hello World"

ArabicTextFixer (`arabic_text_fixer.py`)

Specialized processing for Arabic/RTL text:

NFKC normalization — converts Arabic Presentation Forms (U+FE70-U+FEFF) to standard Arabic letters (U+0600-U+06FF)
Format-aware logic — adapts behavior based on source document type:

Source Type	Text Reversal	Column Reversal
PDF with Presentation Forms	No	No
PDF with Standard Arabic	Yes	Yes
DOCX	Yes	No

Adds metadata flags: had_presentation_forms, skip_column_reversal

TextChunker (`chunker.py`)

Splits documents into smaller chunks for embedding:

Target size: ~500 words (configurable)
Min/Max bounds: 100-600 words
Overlap: 50 words between consecutive chunks
Boundary respect: splits at paragraph and sentence boundaries when possible
Metadata: each chunk tracks chunk_index and total_chunks

from src.chunker import TextChunker

chunker = TextChunker(
    target_words=500,
    min_words=100,
    max_words=600,
    overlap_words=50,
)
chunks = chunker.chunk(text, doc_id="doc-001")

SemanticChunker (`semantic_chunker.py`)

Heading-aware chunking for structured documents (Markdown, HTML):

Splits on heading boundaries (H1 through H6)
Preserves document hierarchy in metadata
Handles both LTR and RTL text directions
Falls back to TextChunker for unstructured content

Deduplicator (`deduplicator.py`)

Removes duplicate content at two levels:

Exact duplicates — MD5 hash comparison
Near-duplicates — Jaccard similarity with configurable threshold (default: 0.85)

from src.deduplicator import Deduplicator

dedup = Deduplicator(similarity_threshold=0.85)
unique_chunks = dedup.deduplicate(chunks)

LanguageDetector (`language_detector.py`)

Detects the language of each chunk using the langdetect library:

Supports 50+ languages
Adds language field to chunk metadata (e.g., "en", "ar")
Identifies RTL languages: Arabic, Hebrew, Farsi, Urdu
Used for language-aware processing downstream

Preprocessing Pipeline

On this page