Parse Endpoints

Base path: /api/v1/parse

Parse documents without storing them in the RAG system. Useful for previewing content before ingestion, extracting clean text for external processing, or testing parsing.

POST `/api/v1/parse`

Parse a document and return cleaned, extracted content without storing it.

Content-Type: multipart/form-data

Form Fields:

Field	Type	Default	Description
`file`	file	required	Document to parse
`include_tables`	bool	true	Include tables in full_text
`include_images`	bool	true	Include image descriptions in full_text
`enable_ocr`	bool	false	Enable OCR for scanned documents
`enable_image_description`	bool	true	Use VLM to describe images/charts
`table_mode`	string	`"fast"`	Table extraction mode: `"accurate"` or `"fast"`
`remove_urls`	bool	true	Remove URLs from text
`normalize_arabic`	bool	true	Normalize Arabic character variations
`summarize_llm_ready`	bool	true	Use LLM to summarize the llm_ready field (auto-disabled for large files)

Response:

{
  "success": true,
  "source_file": "document.pdf",
  "format": ".pdf",
  "metadata": {
    "num_pages": 10,
    "title": "Document Title",
    "author": "Author Name"
  },
  "content": {
    "cleaned_text": "Extracted and cleaned text content...",
    "char_count": 5000
  },
  "tables": [
    {
      "index": 0,
      "page_number": 3,
      "rows": 5,
      "cols": 3,
      "content": "Plain text table content",
      "markdown": "| Col 1 | Col 2 | Col 3 |\n|---|---|---|..."
    }
  ],
  "tables_count": 1,
  "images": [
    {
      "index": 0,
      "page_number": 2,
      "description": "A bar chart showing quarterly revenue..."
    }
  ],
  "images_count": 1,
  "full_text": "Combined text with tables and images...",
  "llm_ready": "Summarized text optimized for LLM consumption..."
}

LLM summarization is auto-disabled for large files (>5MB or >50k characters content).

Example:

curl -X POST http://localhost:9000/api/v1/parse \
  -H "X-API-Key: your-key" \
  -F "file=@document.pdf" \
  -F "enable_ocr=false" \
  -F "table_mode=fast"

GET `/api/v1/parse/formats`

Get the list of supported document formats.

Response:

{
  "formats": [".pdf", ".docx", ".pptx", ".html", ".htm", ".md", ".csv", ".xlsx", ".png", ".jpg", ".jpeg", ".tiff", ".tif", ".bmp"],
  "descriptions": {
    ".pdf": "PDF documents (with OCR support)",
    ".docx": "Microsoft Word documents",
    ".pptx": "Microsoft PowerPoint presentations",
    ".html": "HTML web pages",
    ".md": "Markdown files",
    ".csv": "CSV spreadsheets",
    ".xlsx": "Microsoft Excel spreadsheets",
    ".png": "PNG images (OCR)",
    ".jpg": "JPEG images (OCR)"
  }
}

POST `/api/v1/parse/sections`

Parse a document with semantic chunking. Returns a flat array of chunks with rich metadata.

Form Fields:

Field	Type	Default	Description
`file`	file	required	Document to parse
`enable_ocr`	bool	false	Enable OCR for scanned documents
`enable_picture_description`	bool	true	Use VLM to describe images
`table_mode`	string	`"fast"`	Table extraction mode
`remove_urls`	bool	true	Remove URLs from text
`normalize_arabic`	bool	true	Normalize Arabic characters
`chunking_mode`	string	`"auto"`	`auto`, `semantic`, `page`, `paragraph`, `disabled`
`target_chunk_words`	int	500	Target words per chunk
`max_chunk_words`	int	1500	Maximum words per chunk
`include_full_text`	bool	false	Include full document text in response

Chunking Modes

Mode	Description
`auto`	Analyze document structure and choose the best strategy
`semantic`	Force heading-based chunking
`page`	Force page-based chunking
`paragraph`	Force paragraph-based chunking
`disabled`	Return full document as a single chunk

Response:

{
  "success": true,
  "source_file": "document.pdf",
  "format": ".pdf",
  "metadata": {
    "num_pages": 10,
    "title": "Document Title",
    "author": null
  },
  "chunks": [
    {
      "chunk_id": "chunk_1_0",
      "content": "Section content here...",
      "section_path": "Chapter 1 > Introduction",
      "heading": "Introduction",
      "heading_level": 2,
      "page_start": 1,
      "page_end": 2,
      "tables": [],
      "images": [],
      "chunk_index": 0,
      "total_chunks": 15,
      "word_count": 450,
      "parent_section_id": null,
      "language": "en",
      "is_continuation": false
    }
  ],
  "chunking_stats": {
    "strategy_used": "semantic",
    "total_chunks": 15,
    "sections_found": 12,
    "heading_distribution": {"1": 1, "2": 5, "3": 6},
    "avg_chunk_words": 420.5,
    "pages_processed": 10,
    "total_words": 6307
  },
  "full_text": null
}

POST `/api/v1/parse/sections/stream`

Parse document and stream sections via Server-Sent Events (SSE). The entire document is parsed first, then sections are streamed as they are chunked.

SSE Event Types

Event	Description
`status`	Parsing status updates (`parsing`, `processing`)
`metadata`	Document metadata (after parsing completes)
`section`	Complete section with headers, content, and tables
`stats`	Final chunking statistics
`error`	Error message if parsing fails
`done`	End of stream marker

Form Fields: Same as /sections (except chunking_mode, target_chunk_words, max_chunk_words, include_full_text).

Example (JavaScript):

const formData = new FormData();
formData.append('file', file);

const response = await fetch('/api/v1/parse/sections/stream', {
  method: 'POST',
  headers: { 'X-API-Key': 'your-key' },
  body: formData,
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { value, done } = await reader.read();
  if (done) break;

  const text = decoder.decode(value);
  for (const line of text.split('\n')) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      console.log('Section:', data);
    }
  }
}

POST `/api/v1/parse/sections/incremental`

Parse document with true incremental streaming. Unlike /sections/stream, this parses page-by-page and streams each chunk as soon as it's ready.

Form Fields:

Field	Type	Default	Description
`file`	file	required	Document to parse
`remove_urls`	bool	true	Remove URLs from text
`normalize_arabic`	bool	true	Normalize Arabic characters
`target_chunk_words`	int	500	Target words per chunk
`max_chunk_words`	int	1500	Maximum words per chunk

Supported formats: .pdf, .md, .markdown, .txt, .html, .htm, .docx

SSE Event Types

Event	Description
`metadata`	Document metadata (as soon as known)
`progress`	Parsing progress (`page X of Y`)
`chunk`	Individual chunk
`stats`	Final statistics
`done`	End of stream

Trade-offs vs `/sections/stream`

	`/sections/stream`	`/sections/incremental`
First chunk latency	After full document parse	After first page parse
Progress updates	No per-page progress	Per-page progress
Table/image extraction	Full Docling accuracy	Heuristic-based
Heading detection	Accurate (Markdown)	Heuristic for PDFs

Parse Endpoints

On this page