Retrieval & Generation
Embedding, search, reranking, and LLM generation modules
Retrieval & Generation
The retrieval pipeline finds relevant documents and generates answers using a multi-stage approach.
Retrieval Architecture
RAGSystem (rag.py)
The main orchestrator providing both sync and async APIs:
from src import RAGSystem
rag = RAGSystem(vector_store_path="./data/chroma_db")
# Async query (recommended in FastAPI)
response = await rag.query_async("How do I create an invoice?", top_k=5)
# Sync query (for scripts/CLI)
response = rag.query("How do I create an invoice?", top_k=5)
# Response contains:
print(response.answer) # LLM-generated answer
print(response.sources) # List of source documents with scores
print(response.timing) # Performance breakdown per stepKey methods:
| Method | Description |
|---|---|
query(q) / query_async(q) | Full RAG pipeline: retrieve + rerank + generate |
search(q) / search_async(q) | Semantic search only, no LLM |
add_documents(docs) | Add new documents |
upsert_documents(docs) | Smart update (skip unchanged) |
delete_document(id) | Delete single document |
delete_by_source(source) | Delete by source URL |
delete_by_filter(where) | Delete by metadata filter |
get_document_stats() | Get collection statistics |
health_check() | Verify all services |
SemanticRetriever (retriever.py)
Hybrid search combining vector and keyword retrieval:
Step 1 — Embed Query
The query is converted to a vector via the Embedder.
Step 2 — Parallel Search
Two searches run concurrently via asyncio.gather():
- Vector Search — ChromaDB cosine similarity (top 20 candidates)
- BM25 Search — TF-IDF keyword scoring (top 20 candidates)
Step 3 — RRF Fusion
Results are merged using Reciprocal Rank Fusion:
RRF_score(d) = Σ 1 / (k + rank_i(d))Where k = 60 (standard constant) and rank_i is the document's rank in each result list.
Step 4 — Reranking
The fused candidates are scored by the cross-encoder reranker, and the top k are returned.
Embedder (embedder.py)
Generates vector embeddings via Ollama:
from src.embedder import OllamaEmbedder
embedder = OllamaEmbedder(
model="bge-m3:latest",
base_url="https://your-ollama-server.example.com",
)
# Single embedding
result = embedder.embed("Hello world")
print(result.embedding) # List of floats
print(result.dimensions) # Vector size
# Health check
if embedder.health_check():
print("Ollama is ready")- Async HTTP via
httpx.AsyncClient - Batch support with semaphore-controlled concurrency
- Local fallback using
sentence-transformersfor offline use
Reranker (reranker.py)
Cross-encoder reranking via HuggingFace Text Embeddings Inference (TEI):
- Model:
BAAI/bge-reranker-v2-m3(multilingual) - Scores each (query, document) pair for fine-grained relevance
- Async HTTP to TEI
/rerankendpoint - Can be disabled via
RERANKER_ENABLED=false
When disabled, the retriever returns RRF-fused results without reranking.
LLM Client (llm.py)
Ollama LLM integration for answer generation:
- Model:
gemma3:latest(configurable) - Async HTTP to Ollama
/api/chat - Configurable temperature (default: 0.7)
- Configurable timeout (default: 120s)
- Health check via
/api/tags
The LLM receives a formatted prompt containing:
- The user's question
- The top-k retrieved document chunks as context
- Instructions to answer based only on the provided context
BM25 Index (bm25.py)
In-memory sparse keyword search:
- TF-IDF based scoring algorithm
- Index built from all documents in the ChromaDB collection
- Returns documents ranked by keyword relevance
- Complements vector search — catches exact term matches that semantic search may miss