Mjara Docs
Core Modules

Storage

ChromaDB vector store and PostgreSQL database modules

Storage

The RAG system uses a dual-storage architecture: ChromaDB for vector embeddings and PostgreSQL for metadata.

Storage Architecture

VectorStore (vector_store.py)

ChromaDB wrapper supporting local and remote backends.

Configuration

ModeSettingDescription
LocalVECTOR_PERSIST_PATH=./data/chroma_dbPersistent local storage
RemoteVECTOR_HOST + VECTOR_PORTHTTP client to ChromaDB server

Operations

from src.vector_store import VectorStore

store = VectorStore(persist_path="./data/chroma_db", collection_name="rag_documents")

# Add documents with embeddings
store.add(ids=["doc1"], embeddings=[[0.1, 0.2, ...]], documents=["text"], metadatas=[{...}])

# Query by vector similarity
results = store.query(query_embeddings=[[0.1, 0.2, ...]], n_results=10)

# Get documents by ID
docs = store.get(ids=["doc1"])

# Delete documents
store.delete(ids=["doc1"])

# Collection stats
count = store.count()

Index Type

ChromaDB uses HNSW (Hierarchical Navigable Small World) with cosine similarity for fast approximate nearest-neighbor search.

DocumentManager (document_manager.py)

Unified CRUD interface that coordinates operations across both ChromaDB and PostgreSQL:

from src import create_document_manager, Document

manager = create_document_manager(
    persist_path="./data/chroma_db",
    embedding_model="bge-m3:latest",
)

# Add a document
doc = Document(id="doc1", text="content", metadata={"source": "web"})
manager.add(doc)

# Upsert — only re-embeds if content changed
result = manager.upsert_many(documents)
print(f"Added: {result.added}, Updated: {result.updated}, Skipped: {result.skipped}")

# Delete operations
manager.delete("doc1")
manager.delete_by_source("https://example.com")
manager.delete_by_filter({"language": "en"})

Upsert Logic

The upsert operation compares content hashes to minimize unnecessary embedding:

PostgreSQL Database (database/)

Models (models.py)

Three SQLAlchemy ORM models:

Source — tracks document origins:

ColumnTypeDescription
idUUIDPrimary key
source_typestring"url" or "file"
urltextSource URL (if scraped)
url_hashstringMD5 hash of URL
filepathtextFile path (if uploaded)
filepath_hashstringMD5 hash of filepath
filenamestringOriginal filename
created_attimestampFirst ingestion time
last_scraped_attimestampLast re-scrape time
scrape_countintNumber of times ingested

Document — tracks individual chunks:

ColumnTypeDescription
idstringPrimary key (matches ChromaDB ID)
source_idUUIDForeign key to sources
titlestringDocument title
sectionstringCategory/section
contenttextFull text content
content_hashstringMD5 hash of content
doc_metadataJSONBAdditional metadata
scraped_attimestampIngestion timestamp
is_activebooleanSoft delete flag

DuplicateLog — records duplicate detection events:

ColumnTypeDescription
idintPrimary key
duplicate_typestring"url", "content", "near_duplicate"
action_takenstring"skipped", "updated"
original_doc_idstringID of the existing document
attempted_urltextURL that was a duplicate
attempted_filepathtextFile that was a duplicate
created_attimestampDetection timestamp

Repository (repository.py)

CRUD operations with transaction support:

  • add_source() / get_source_by_url() / get_source_by_filepath()
  • add_document() / get_document() / delete_document()
  • log_duplicate() / get_duplicates()
  • get_stats() — aggregated statistics

Session (session.py)

Connection management using SQLAlchemy:

  • Connection pooling
  • Session factory
  • Automatic table creation
  • Configurable via DB_URL environment variable

On this page