Retrieval & Generation

The retrieval pipeline finds relevant documents and generates answers using a multi-stage approach.

Retrieval Architecture

RAGSystem (`rag.py`)

The main orchestrator providing both sync and async APIs:

from src import RAGSystem

rag = RAGSystem(vector_store_path="./data/chroma_db")

# Async query (recommended in FastAPI)
response = await rag.query_async("How do I create an invoice?", top_k=5)

# Sync query (for scripts/CLI)
response = rag.query("How do I create an invoice?", top_k=5)

# Response contains:
print(response.answer)     # LLM-generated answer
print(response.sources)    # List of source documents with scores
print(response.timing)     # Performance breakdown per step

Key methods:

Method	Description
`query(q)` / `query_async(q)`	Full RAG pipeline: retrieve + rerank + generate
`search(q)` / `search_async(q)`	Semantic search only, no LLM
`add_documents(docs)`	Add new documents
`upsert_documents(docs)`	Smart update (skip unchanged)
`delete_document(id)`	Delete single document
`delete_by_source(source)`	Delete by source URL
`delete_by_filter(where)`	Delete by metadata filter
`get_document_stats()`	Get collection statistics
`health_check()`	Verify all services

SemanticRetriever (`retriever.py`)

Hybrid search combining vector and keyword retrieval:

Step 1 — Embed Query

The query is converted to a vector via the Embedder.

Step 2 — Parallel Search

Two searches run concurrently via asyncio.gather():

Vector Search — ChromaDB cosine similarity (top 20 candidates)
BM25 Search — TF-IDF keyword scoring (top 20 candidates)

Step 3 — RRF Fusion

Results are merged using Reciprocal Rank Fusion:

RRF_score(d) = Σ 1 / (k + rank_i(d))

Where k = 60 (standard constant) and rank_i is the document's rank in each result list.

Step 4 — Reranking

The fused candidates are scored by the cross-encoder reranker, and the top k are returned.

Embedder (`embedder.py`)

Generates vector embeddings via Ollama:

from src.embedder import OllamaEmbedder

embedder = OllamaEmbedder(
    model="bge-m3:latest",
    base_url="https://your-ollama-server.example.com",
)

# Single embedding
result = embedder.embed("Hello world")
print(result.embedding)    # List of floats
print(result.dimensions)   # Vector size

# Health check
if embedder.health_check():
    print("Ollama is ready")

Async HTTP via httpx.AsyncClient
Batch support with semaphore-controlled concurrency
Local fallback using sentence-transformers for offline use

Reranker (`reranker.py`)

Cross-encoder reranking via HuggingFace Text Embeddings Inference (TEI):

Model: BAAI/bge-reranker-v2-m3 (multilingual)
Scores each (query, document) pair for fine-grained relevance
Async HTTP to TEI /rerank endpoint
Can be disabled via RERANKER_ENABLED=false

When disabled, the retriever returns RRF-fused results without reranking.

LLM Client (`llm.py`)

Ollama LLM integration for answer generation:

Model: gemma3:latest (configurable)
Async HTTP to Ollama /api/chat
Configurable temperature (default: 0.7)
Configurable timeout (default: 120s)
Health check via /api/tags

The LLM receives a formatted prompt containing:

The user's question
The top-k retrieved document chunks as context
Instructions to answer based only on the provided context

BM25 Index (`bm25.py`)

In-memory sparse keyword search:

TF-IDF based scoring algorithm
Index built from all documents in the ChromaDB collection
Returns documents ranked by keyword relevance
Complements vector search — catches exact term matches that semantic search may miss

Retrieval & Generation

On this page