Ingestion

The ingestion layer acquires content from external sources and feeds it into the processing pipeline.

Ingestion Flow

WebScraper (`scraper.py`)

Scrapes web pages and processes content through the RAG pipeline.

from src.scraper import create_scraper

scraper = create_scraper(
    vector_store_path="./data/chroma_db",
    collection_name="rag_documents",
)

# Scrape and ingest a single URL
result = scraper.scrape_and_ingest("https://docs.example.com/page")
print(f"Title: {result.title}, Chunks added: {result.chunks_added}")

# Check for duplicates
if result.is_duplicate:
    print(f"Duplicate: {result.duplicate_message}")

# Scrape multiple URLs
results = scraper.scrape_and_ingest_many([
    "https://docs.example.com/page1",
    "https://docs.example.com/page2",
])

Scraping Process

Fetch — HTTP GET via requests library
Parse — Extract text from HTML using BeautifulSoup4
Dedup — Check URL hash against existing sources in PostgreSQL
Process — Run through RAGPreprocessor (clean, chunk, dedup, detect language)
Embed — Generate vectors via Ollama
Store — Save to ChromaDB + PostgreSQL

FileUploader (`uploader.py`)

Processes uploaded files through the RAG pipeline.

from src.uploader import create_uploader

uploader = create_uploader(
    vector_store_path="./data/chroma_db",
    collection_name="rag_documents",
)

# Upload and ingest a single file
result = uploader.upload_and_ingest("/path/to/document.pdf", section="manuals")

# Upload multiple files
results = uploader.upload_and_ingest_many([
    "/path/to/doc1.md",
    "/path/to/doc2.pdf",
], section="documentation")

Upload Process

Save — Copy file to data/uploads/ directory
Parse — Extract content using DocumentParser (Docling)
Dedup — Check content hash against existing documents
Process — Run through RAGPreprocessor
Embed — Generate vectors via Ollama
Store — Save to ChromaDB + PostgreSQL

DocumentParser (`document_parser.py`)

Multi-format document parser built on Docling.

from src.document_parser import DocumentParser

parser = DocumentParser(
    enable_ocr=True,
    enable_table_extraction=True,
    table_mode="accurate",
    extract_images=True,
    clean_html=True,
)

# Parse a document
doc = parser.parse("document.pdf")

# Access content
print(doc.text)        # Plain text
print(doc.markdown)    # Markdown format
print(doc.tables)      # List of ParsedTable objects
print(doc.images)      # List of ParsedImage objects

# Full text with tables and images
full = doc.get_full_text(include_tables=True, include_images=True)

Supported Formats

Format	Extension	Features
PDF	`.pdf`	Text, tables, images, OCR
Word	`.docx`	Text, tables
PowerPoint	`.pptx`	Text, slides
HTML	`.html`, `.htm`	Text, cleaned content
Markdown	`.md`	Text
CSV	`.csv`	Tabular data
Excel	`.xlsx`	Tabular data
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`	OCR text extraction

Features

OCR — Tesseract + RapidOCR for scanned documents and images
Table Extraction — structure-preserving table extraction (accurate or fast mode)
Image Description — VLM-based description of charts and diagrams via Ollama
HTML Cleaning — removes navigation, scripts, and UI noise from HTML pages

CLI Scripts

Scrape URLs

# Single URL
python scripts/scrape_url.py "https://docs.example.com/page"

# Multiple URLs
python scripts/scrape_url.py "url1" "url2" "url3"

# From file (one URL per line)
python scripts/scrape_url.py --file urls.txt

Upload Files

# Single file
python scripts/upload_file.py /path/to/document.pdf

# Multiple files
python scripts/upload_file.py file1.md file2.txt file3.pdf

# Folder (recursive)
python scripts/upload_file.py --folder ./documents --recursive

# With section
python scripts/upload_file.py document.pdf --section "accounting"

Delete Documents

python scripts/delete_docs.py

Interactive menu for deleting by source, section, title, or search.

Ingestion

On this page