Core Modules
Preprocessing Pipeline
Text cleaning, chunking, deduplication, and language detection modules
Preprocessing Pipeline
The preprocessing pipeline transforms raw document text into clean, chunked, and annotated content ready for embedding and storage.
Pipeline Flow
RAGPreprocessor (pipeline.py)
The orchestrator that runs all preprocessing steps in sequence.
from src import RAGPreprocessor
preprocessor = RAGPreprocessor(
target_chunk_words=500,
min_chunk_words=100,
max_chunk_words=600,
overlap_words=50,
dedupe_similarity_threshold=0.85,
normalize_arabic=False,
remove_urls=True,
)
# Process a batch of documents
chunks = preprocessor.process_documents([
{"text": "Raw content...", "source": "web", "id": "doc-001"},
])
# Get processing statistics
stats = preprocessor.get_stats(chunks)TextCleaner (cleaner.py)
Removes noise from raw text:
- HTML tags — strips all HTML markup
- JavaScript/CSS — removes
<script>and<style>blocks - Encoding fixes — repairs mojibake via
ftfy - Unwanted symbols — removes control characters and special symbols
- URLs — optionally removes HTTP/HTTPS URLs
- Whitespace — normalizes excessive whitespace and blank lines
from src.cleaner import TextCleaner
cleaner = TextCleaner(normalize_arabic=False, remove_urls=True)
clean = cleaner.clean("<p>Hello <script>alert('x')</script> World</p>")
# Result: "Hello World"ArabicTextFixer (arabic_text_fixer.py)
Specialized processing for Arabic/RTL text:
- NFKC normalization — converts Arabic Presentation Forms (U+FE70-U+FEFF) to standard Arabic letters (U+0600-U+06FF)
- Format-aware logic — adapts behavior based on source document type:
| Source Type | Text Reversal | Column Reversal |
|---|---|---|
| PDF with Presentation Forms | No | No |
| PDF with Standard Arabic | Yes | Yes |
| DOCX | Yes | No |
- Adds metadata flags:
had_presentation_forms,skip_column_reversal
TextChunker (chunker.py)
Splits documents into smaller chunks for embedding:
- Target size: ~500 words (configurable)
- Min/Max bounds: 100-600 words
- Overlap: 50 words between consecutive chunks
- Boundary respect: splits at paragraph and sentence boundaries when possible
- Metadata: each chunk tracks
chunk_indexandtotal_chunks
from src.chunker import TextChunker
chunker = TextChunker(
target_words=500,
min_words=100,
max_words=600,
overlap_words=50,
)
chunks = chunker.chunk(text, doc_id="doc-001")SemanticChunker (semantic_chunker.py)
Heading-aware chunking for structured documents (Markdown, HTML):
- Splits on heading boundaries (H1 through H6)
- Preserves document hierarchy in metadata
- Handles both LTR and RTL text directions
- Falls back to
TextChunkerfor unstructured content
Deduplicator (deduplicator.py)
Removes duplicate content at two levels:
- Exact duplicates — MD5 hash comparison
- Near-duplicates — Jaccard similarity with configurable threshold (default: 0.85)
from src.deduplicator import Deduplicator
dedup = Deduplicator(similarity_threshold=0.85)
unique_chunks = dedup.deduplicate(chunks)LanguageDetector (language_detector.py)
Detects the language of each chunk using the langdetect library:
- Supports 50+ languages
- Adds
languagefield to chunk metadata (e.g.,"en","ar") - Identifies RTL languages: Arabic, Hebrew, Farsi, Urdu
- Used for language-aware processing downstream