Mjara Docs
Core Modules

Ingestion

Web scraping, file upload, and document parsing modules

Ingestion

The ingestion layer acquires content from external sources and feeds it into the processing pipeline.

Ingestion Flow

WebScraper (scraper.py)

Scrapes web pages and processes content through the RAG pipeline.

from src.scraper import create_scraper

scraper = create_scraper(
    vector_store_path="./data/chroma_db",
    collection_name="rag_documents",
)

# Scrape and ingest a single URL
result = scraper.scrape_and_ingest("https://docs.example.com/page")
print(f"Title: {result.title}, Chunks added: {result.chunks_added}")

# Check for duplicates
if result.is_duplicate:
    print(f"Duplicate: {result.duplicate_message}")

# Scrape multiple URLs
results = scraper.scrape_and_ingest_many([
    "https://docs.example.com/page1",
    "https://docs.example.com/page2",
])

Scraping Process

  1. Fetch — HTTP GET via requests library
  2. Parse — Extract text from HTML using BeautifulSoup4
  3. Dedup — Check URL hash against existing sources in PostgreSQL
  4. Process — Run through RAGPreprocessor (clean, chunk, dedup, detect language)
  5. Embed — Generate vectors via Ollama
  6. Store — Save to ChromaDB + PostgreSQL

FileUploader (uploader.py)

Processes uploaded files through the RAG pipeline.

from src.uploader import create_uploader

uploader = create_uploader(
    vector_store_path="./data/chroma_db",
    collection_name="rag_documents",
)

# Upload and ingest a single file
result = uploader.upload_and_ingest("/path/to/document.pdf", section="manuals")

# Upload multiple files
results = uploader.upload_and_ingest_many([
    "/path/to/doc1.md",
    "/path/to/doc2.pdf",
], section="documentation")

Upload Process

  1. Save — Copy file to data/uploads/ directory
  2. Parse — Extract content using DocumentParser (Docling)
  3. Dedup — Check content hash against existing documents
  4. Process — Run through RAGPreprocessor
  5. Embed — Generate vectors via Ollama
  6. Store — Save to ChromaDB + PostgreSQL

DocumentParser (document_parser.py)

Multi-format document parser built on Docling.

from src.document_parser import DocumentParser

parser = DocumentParser(
    enable_ocr=True,
    enable_table_extraction=True,
    table_mode="accurate",
    extract_images=True,
    clean_html=True,
)

# Parse a document
doc = parser.parse("document.pdf")

# Access content
print(doc.text)        # Plain text
print(doc.markdown)    # Markdown format
print(doc.tables)      # List of ParsedTable objects
print(doc.images)      # List of ParsedImage objects

# Full text with tables and images
full = doc.get_full_text(include_tables=True, include_images=True)

Supported Formats

FormatExtensionFeatures
PDF.pdfText, tables, images, OCR
Word.docxText, tables
PowerPoint.pptxText, slides
HTML.html, .htmText, cleaned content
Markdown.mdText
CSV.csvTabular data
Excel.xlsxTabular data
Images.png, .jpg, .jpeg, .tiff, .bmpOCR text extraction

Features

  • OCR — Tesseract + RapidOCR for scanned documents and images
  • Table Extraction — structure-preserving table extraction (accurate or fast mode)
  • Image Description — VLM-based description of charts and diagrams via Ollama
  • HTML Cleaning — removes navigation, scripts, and UI noise from HTML pages

CLI Scripts

Scrape URLs

# Single URL
python scripts/scrape_url.py "https://docs.example.com/page"

# Multiple URLs
python scripts/scrape_url.py "url1" "url2" "url3"

# From file (one URL per line)
python scripts/scrape_url.py --file urls.txt

Upload Files

# Single file
python scripts/upload_file.py /path/to/document.pdf

# Multiple files
python scripts/upload_file.py file1.md file2.txt file3.pdf

# Folder (recursive)
python scripts/upload_file.py --folder ./documents --recursive

# With section
python scripts/upload_file.py document.pdf --section "accounting"

Delete Documents

python scripts/delete_docs.py

Interactive menu for deleting by source, section, title, or search.

On this page