Parse Endpoints
Document parsing and semantic chunking endpoints
Parse Endpoints
Base path: /api/v1/parse
Parse documents without storing them in the RAG system. Useful for previewing content before ingestion, extracting clean text for external processing, or testing parsing.
POST /api/v1/parse
Parse a document and return cleaned, extracted content without storing it.
Content-Type: multipart/form-data
Form Fields:
| Field | Type | Default | Description |
|---|---|---|---|
file | file | required | Document to parse |
include_tables | bool | true | Include tables in full_text |
include_images | bool | true | Include image descriptions in full_text |
enable_ocr | bool | false | Enable OCR for scanned documents |
enable_image_description | bool | true | Use VLM to describe images/charts |
table_mode | string | "fast" | Table extraction mode: "accurate" or "fast" |
remove_urls | bool | true | Remove URLs from text |
normalize_arabic | bool | true | Normalize Arabic character variations |
summarize_llm_ready | bool | true | Use LLM to summarize the llm_ready field (auto-disabled for large files) |
Response:
{
"success": true,
"source_file": "document.pdf",
"format": ".pdf",
"metadata": {
"num_pages": 10,
"title": "Document Title",
"author": "Author Name"
},
"content": {
"cleaned_text": "Extracted and cleaned text content...",
"char_count": 5000
},
"tables": [
{
"index": 0,
"page_number": 3,
"rows": 5,
"cols": 3,
"content": "Plain text table content",
"markdown": "| Col 1 | Col 2 | Col 3 |\n|---|---|---|..."
}
],
"tables_count": 1,
"images": [
{
"index": 0,
"page_number": 2,
"description": "A bar chart showing quarterly revenue..."
}
],
"images_count": 1,
"full_text": "Combined text with tables and images...",
"llm_ready": "Summarized text optimized for LLM consumption..."
}LLM summarization is auto-disabled for large files (>5MB or >50k characters content).
Example:
curl -X POST http://localhost:9000/api/v1/parse \
-H "X-API-Key: your-key" \
-F "file=@document.pdf" \
-F "enable_ocr=false" \
-F "table_mode=fast"GET /api/v1/parse/formats
Get the list of supported document formats.
Response:
{
"formats": [".pdf", ".docx", ".pptx", ".html", ".htm", ".md", ".csv", ".xlsx", ".png", ".jpg", ".jpeg", ".tiff", ".tif", ".bmp"],
"descriptions": {
".pdf": "PDF documents (with OCR support)",
".docx": "Microsoft Word documents",
".pptx": "Microsoft PowerPoint presentations",
".html": "HTML web pages",
".md": "Markdown files",
".csv": "CSV spreadsheets",
".xlsx": "Microsoft Excel spreadsheets",
".png": "PNG images (OCR)",
".jpg": "JPEG images (OCR)"
}
}POST /api/v1/parse/sections
Parse a document with semantic chunking. Returns a flat array of chunks with rich metadata.
Form Fields:
| Field | Type | Default | Description |
|---|---|---|---|
file | file | required | Document to parse |
enable_ocr | bool | false | Enable OCR for scanned documents |
enable_picture_description | bool | true | Use VLM to describe images |
table_mode | string | "fast" | Table extraction mode |
remove_urls | bool | true | Remove URLs from text |
normalize_arabic | bool | true | Normalize Arabic characters |
chunking_mode | string | "auto" | auto, semantic, page, paragraph, disabled |
target_chunk_words | int | 500 | Target words per chunk |
max_chunk_words | int | 1500 | Maximum words per chunk |
include_full_text | bool | false | Include full document text in response |
Chunking Modes
| Mode | Description |
|---|---|
auto | Analyze document structure and choose the best strategy |
semantic | Force heading-based chunking |
page | Force page-based chunking |
paragraph | Force paragraph-based chunking |
disabled | Return full document as a single chunk |
Response:
{
"success": true,
"source_file": "document.pdf",
"format": ".pdf",
"metadata": {
"num_pages": 10,
"title": "Document Title",
"author": null
},
"chunks": [
{
"chunk_id": "chunk_1_0",
"content": "Section content here...",
"section_path": "Chapter 1 > Introduction",
"heading": "Introduction",
"heading_level": 2,
"page_start": 1,
"page_end": 2,
"tables": [],
"images": [],
"chunk_index": 0,
"total_chunks": 15,
"word_count": 450,
"parent_section_id": null,
"language": "en",
"is_continuation": false
}
],
"chunking_stats": {
"strategy_used": "semantic",
"total_chunks": 15,
"sections_found": 12,
"heading_distribution": {"1": 1, "2": 5, "3": 6},
"avg_chunk_words": 420.5,
"pages_processed": 10,
"total_words": 6307
},
"full_text": null
}POST /api/v1/parse/sections/stream
Parse document and stream sections via Server-Sent Events (SSE). The entire document is parsed first, then sections are streamed as they are chunked.
SSE Event Types
| Event | Description |
|---|---|
status | Parsing status updates (parsing, processing) |
metadata | Document metadata (after parsing completes) |
section | Complete section with headers, content, and tables |
stats | Final chunking statistics |
error | Error message if parsing fails |
done | End of stream marker |
Form Fields: Same as /sections (except chunking_mode, target_chunk_words, max_chunk_words, include_full_text).
Example (JavaScript):
const formData = new FormData();
formData.append('file', file);
const response = await fetch('/api/v1/parse/sections/stream', {
method: 'POST',
headers: { 'X-API-Key': 'your-key' },
body: formData,
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
const text = decoder.decode(value);
for (const line of text.split('\n')) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
console.log('Section:', data);
}
}
}POST /api/v1/parse/sections/incremental
Parse document with true incremental streaming. Unlike /sections/stream, this parses page-by-page and streams each chunk as soon as it's ready.
Form Fields:
| Field | Type | Default | Description |
|---|---|---|---|
file | file | required | Document to parse |
remove_urls | bool | true | Remove URLs from text |
normalize_arabic | bool | true | Normalize Arabic characters |
target_chunk_words | int | 500 | Target words per chunk |
max_chunk_words | int | 1500 | Maximum words per chunk |
Supported formats: .pdf, .md, .markdown, .txt, .html, .htm, .docx
SSE Event Types
| Event | Description |
|---|---|
metadata | Document metadata (as soon as known) |
progress | Parsing progress (page X of Y) |
chunk | Individual chunk |
stats | Final statistics |
done | End of stream |
Trade-offs vs /sections/stream
/sections/stream | /sections/incremental | |
|---|---|---|
| First chunk latency | After full document parse | After first page parse |
| Progress updates | No per-page progress | Per-page progress |
| Table/image extraction | Full Docling accuracy | Heuristic-based |
| Heading detection | Accurate (Markdown) | Heuristic for PDFs |