Mjara Docs
Core Modules

Retrieval & Generation

Embedding, search, reranking, and LLM generation modules

Retrieval & Generation

The retrieval pipeline finds relevant documents and generates answers using a multi-stage approach.

Retrieval Architecture

RAGSystem (rag.py)

The main orchestrator providing both sync and async APIs:

from src import RAGSystem

rag = RAGSystem(vector_store_path="./data/chroma_db")

# Async query (recommended in FastAPI)
response = await rag.query_async("How do I create an invoice?", top_k=5)

# Sync query (for scripts/CLI)
response = rag.query("How do I create an invoice?", top_k=5)

# Response contains:
print(response.answer)     # LLM-generated answer
print(response.sources)    # List of source documents with scores
print(response.timing)     # Performance breakdown per step

Key methods:

MethodDescription
query(q) / query_async(q)Full RAG pipeline: retrieve + rerank + generate
search(q) / search_async(q)Semantic search only, no LLM
add_documents(docs)Add new documents
upsert_documents(docs)Smart update (skip unchanged)
delete_document(id)Delete single document
delete_by_source(source)Delete by source URL
delete_by_filter(where)Delete by metadata filter
get_document_stats()Get collection statistics
health_check()Verify all services

SemanticRetriever (retriever.py)

Hybrid search combining vector and keyword retrieval:

Step 1 — Embed Query

The query is converted to a vector via the Embedder.

Two searches run concurrently via asyncio.gather():

  • Vector Search — ChromaDB cosine similarity (top 20 candidates)
  • BM25 Search — TF-IDF keyword scoring (top 20 candidates)

Step 3 — RRF Fusion

Results are merged using Reciprocal Rank Fusion:

RRF_score(d) = Σ 1 / (k + rank_i(d))

Where k = 60 (standard constant) and rank_i is the document's rank in each result list.

Step 4 — Reranking

The fused candidates are scored by the cross-encoder reranker, and the top k are returned.

Embedder (embedder.py)

Generates vector embeddings via Ollama:

from src.embedder import OllamaEmbedder

embedder = OllamaEmbedder(
    model="bge-m3:latest",
    base_url="https://your-ollama-server.example.com",
)

# Single embedding
result = embedder.embed("Hello world")
print(result.embedding)    # List of floats
print(result.dimensions)   # Vector size

# Health check
if embedder.health_check():
    print("Ollama is ready")
  • Async HTTP via httpx.AsyncClient
  • Batch support with semaphore-controlled concurrency
  • Local fallback using sentence-transformers for offline use

Reranker (reranker.py)

Cross-encoder reranking via HuggingFace Text Embeddings Inference (TEI):

  • Model: BAAI/bge-reranker-v2-m3 (multilingual)
  • Scores each (query, document) pair for fine-grained relevance
  • Async HTTP to TEI /rerank endpoint
  • Can be disabled via RERANKER_ENABLED=false

When disabled, the retriever returns RRF-fused results without reranking.

LLM Client (llm.py)

Ollama LLM integration for answer generation:

  • Model: gemma3:latest (configurable)
  • Async HTTP to Ollama /api/chat
  • Configurable temperature (default: 0.7)
  • Configurable timeout (default: 120s)
  • Health check via /api/tags

The LLM receives a formatted prompt containing:

  1. The user's question
  2. The top-k retrieved document chunks as context
  3. Instructions to answer based only on the provided context

BM25 Index (bm25.py)

In-memory sparse keyword search:

  • TF-IDF based scoring algorithm
  • Index built from all documents in the ChromaDB collection
  • Returns documents ranked by keyword relevance
  • Complements vector search — catches exact term matches that semantic search may miss

On this page