Core Modules
Ingestion
Web scraping, file upload, and document parsing modules
Ingestion
The ingestion layer acquires content from external sources and feeds it into the processing pipeline.
Ingestion Flow
WebScraper (scraper.py)
Scrapes web pages and processes content through the RAG pipeline.
from src.scraper import create_scraper
scraper = create_scraper(
vector_store_path="./data/chroma_db",
collection_name="rag_documents",
)
# Scrape and ingest a single URL
result = scraper.scrape_and_ingest("https://docs.example.com/page")
print(f"Title: {result.title}, Chunks added: {result.chunks_added}")
# Check for duplicates
if result.is_duplicate:
print(f"Duplicate: {result.duplicate_message}")
# Scrape multiple URLs
results = scraper.scrape_and_ingest_many([
"https://docs.example.com/page1",
"https://docs.example.com/page2",
])Scraping Process
- Fetch — HTTP GET via
requestslibrary - Parse — Extract text from HTML using
BeautifulSoup4 - Dedup — Check URL hash against existing sources in PostgreSQL
- Process — Run through RAGPreprocessor (clean, chunk, dedup, detect language)
- Embed — Generate vectors via Ollama
- Store — Save to ChromaDB + PostgreSQL
FileUploader (uploader.py)
Processes uploaded files through the RAG pipeline.
from src.uploader import create_uploader
uploader = create_uploader(
vector_store_path="./data/chroma_db",
collection_name="rag_documents",
)
# Upload and ingest a single file
result = uploader.upload_and_ingest("/path/to/document.pdf", section="manuals")
# Upload multiple files
results = uploader.upload_and_ingest_many([
"/path/to/doc1.md",
"/path/to/doc2.pdf",
], section="documentation")Upload Process
- Save — Copy file to
data/uploads/directory - Parse — Extract content using DocumentParser (Docling)
- Dedup — Check content hash against existing documents
- Process — Run through RAGPreprocessor
- Embed — Generate vectors via Ollama
- Store — Save to ChromaDB + PostgreSQL
DocumentParser (document_parser.py)
Multi-format document parser built on Docling.
from src.document_parser import DocumentParser
parser = DocumentParser(
enable_ocr=True,
enable_table_extraction=True,
table_mode="accurate",
extract_images=True,
clean_html=True,
)
# Parse a document
doc = parser.parse("document.pdf")
# Access content
print(doc.text) # Plain text
print(doc.markdown) # Markdown format
print(doc.tables) # List of ParsedTable objects
print(doc.images) # List of ParsedImage objects
# Full text with tables and images
full = doc.get_full_text(include_tables=True, include_images=True)Supported Formats
| Format | Extension | Features |
|---|---|---|
.pdf | Text, tables, images, OCR | |
| Word | .docx | Text, tables |
| PowerPoint | .pptx | Text, slides |
| HTML | .html, .htm | Text, cleaned content |
| Markdown | .md | Text |
| CSV | .csv | Tabular data |
| Excel | .xlsx | Tabular data |
| Images | .png, .jpg, .jpeg, .tiff, .bmp | OCR text extraction |
Features
- OCR — Tesseract + RapidOCR for scanned documents and images
- Table Extraction — structure-preserving table extraction (accurate or fast mode)
- Image Description — VLM-based description of charts and diagrams via Ollama
- HTML Cleaning — removes navigation, scripts, and UI noise from HTML pages
CLI Scripts
Scrape URLs
# Single URL
python scripts/scrape_url.py "https://docs.example.com/page"
# Multiple URLs
python scripts/scrape_url.py "url1" "url2" "url3"
# From file (one URL per line)
python scripts/scrape_url.py --file urls.txtUpload Files
# Single file
python scripts/upload_file.py /path/to/document.pdf
# Multiple files
python scripts/upload_file.py file1.md file2.txt file3.pdf
# Folder (recursive)
python scripts/upload_file.py --folder ./documents --recursive
# With section
python scripts/upload_file.py document.pdf --section "accounting"Delete Documents
python scripts/delete_docs.pyInteractive menu for deleting by source, section, title, or search.