Architecture
Data Flow
Document ingestion and query processing data flows
Data Flow
This page illustrates how data flows through the RAG system during the two primary operations: document ingestion and query processing.
Document Ingestion Flow
When a document is ingested (via URL scraping or file upload), it passes through the following pipeline:
Step-by-Step Breakdown
| Step | Component | What Happens |
|---|---|---|
| 1. Input | WebScraper / FileUploader | Content is fetched from URL or received as file upload |
| 2. Parse | DocumentParser | Docling extracts text, tables, and images from the document |
| 3. Dedup Check | Deduplicator | MD5 hash of content checked against existing documents |
| 4. Clean | TextCleaner | HTML/JS/CSS removed, encoding fixed, whitespace normalized |
| 5. Arabic Fix | ArabicTextFixer | NFKC normalization for Arabic presentation forms |
| 6. Chunk | TextChunker | Text split into ~500 word chunks with 50-word overlap |
| 7. Deduplicate | Deduplicator | Near-duplicate chunks removed (Jaccard similarity) |
| 8. Language | LanguageDetector | Language detected and added to chunk metadata |
| 9. Embed | Embedder | Vector embedding generated via Ollama bge-m3 |
| 10. Store | DocumentManager | Stored in ChromaDB (vectors) and PostgreSQL (metadata) |
Query Processing Flow
When a user submits a query, the system retrieves relevant context and generates an answer:
Step-by-Step Breakdown
| Step | Component | What Happens | Typical Time |
|---|---|---|---|
| 1. Embed | Embedder | Query text → vector embedding via Ollama | ~50ms |
| 2. Vector Search | ChromaDB | Cosine similarity search, top 20 candidates | ~120ms |
| 3. BM25 Search | BM25 Index | TF-IDF keyword scoring, top 20 candidates | ~10ms |
| 4. RRF Fusion | SemanticRetriever | Merge vector + BM25 results with RRF | ~1ms |
| 5. Rerank | TEI Reranker | Cross-encoder scores all candidates | ~2.5s |
| 6. Format | RAGSystem | Top 5 documents formatted as LLM context | ~1ms |
| 7. Generate | Ollama LLM | Answer generated from context + question | ~10s |
| Total | ~12.7s |
Query Time Distribution
Async Execution Model
The query pipeline leverages async execution for maximum performance:
Key performance optimizations:
- Parallel embedding + BM25 —
asyncio.gather()runs these concurrently - Thread pool offloading — CPU-bound ChromaDB and BM25 operations use thread pool
- Async HTTP — all external calls (Ollama, TEI) use
httpx.AsyncClient - Batch embedding — semaphore-controlled concurrency for batch operations