Mjara Docs
API Reference

Parse Endpoints

Document parsing and semantic chunking endpoints

Parse Endpoints

Base path: /api/v1/parse

Parse documents without storing them in the RAG system. Useful for previewing content before ingestion, extracting clean text for external processing, or testing parsing.

POST /api/v1/parse

Parse a document and return cleaned, extracted content without storing it.

Content-Type: multipart/form-data

Form Fields:

FieldTypeDefaultDescription
filefilerequiredDocument to parse
include_tablesbooltrueInclude tables in full_text
include_imagesbooltrueInclude image descriptions in full_text
enable_ocrboolfalseEnable OCR for scanned documents
enable_image_descriptionbooltrueUse VLM to describe images/charts
table_modestring"fast"Table extraction mode: "accurate" or "fast"
remove_urlsbooltrueRemove URLs from text
normalize_arabicbooltrueNormalize Arabic character variations
summarize_llm_readybooltrueUse LLM to summarize the llm_ready field (auto-disabled for large files)

Response:

{
  "success": true,
  "source_file": "document.pdf",
  "format": ".pdf",
  "metadata": {
    "num_pages": 10,
    "title": "Document Title",
    "author": "Author Name"
  },
  "content": {
    "cleaned_text": "Extracted and cleaned text content...",
    "char_count": 5000
  },
  "tables": [
    {
      "index": 0,
      "page_number": 3,
      "rows": 5,
      "cols": 3,
      "content": "Plain text table content",
      "markdown": "| Col 1 | Col 2 | Col 3 |\n|---|---|---|..."
    }
  ],
  "tables_count": 1,
  "images": [
    {
      "index": 0,
      "page_number": 2,
      "description": "A bar chart showing quarterly revenue..."
    }
  ],
  "images_count": 1,
  "full_text": "Combined text with tables and images...",
  "llm_ready": "Summarized text optimized for LLM consumption..."
}

LLM summarization is auto-disabled for large files (>5MB or >50k characters content).

Example:

curl -X POST http://localhost:9000/api/v1/parse \
  -H "X-API-Key: your-key" \
  -F "file=@document.pdf" \
  -F "enable_ocr=false" \
  -F "table_mode=fast"

GET /api/v1/parse/formats

Get the list of supported document formats.

Response:

{
  "formats": [".pdf", ".docx", ".pptx", ".html", ".htm", ".md", ".csv", ".xlsx", ".png", ".jpg", ".jpeg", ".tiff", ".tif", ".bmp"],
  "descriptions": {
    ".pdf": "PDF documents (with OCR support)",
    ".docx": "Microsoft Word documents",
    ".pptx": "Microsoft PowerPoint presentations",
    ".html": "HTML web pages",
    ".md": "Markdown files",
    ".csv": "CSV spreadsheets",
    ".xlsx": "Microsoft Excel spreadsheets",
    ".png": "PNG images (OCR)",
    ".jpg": "JPEG images (OCR)"
  }
}

POST /api/v1/parse/sections

Parse a document with semantic chunking. Returns a flat array of chunks with rich metadata.

Form Fields:

FieldTypeDefaultDescription
filefilerequiredDocument to parse
enable_ocrboolfalseEnable OCR for scanned documents
enable_picture_descriptionbooltrueUse VLM to describe images
table_modestring"fast"Table extraction mode
remove_urlsbooltrueRemove URLs from text
normalize_arabicbooltrueNormalize Arabic characters
chunking_modestring"auto"auto, semantic, page, paragraph, disabled
target_chunk_wordsint500Target words per chunk
max_chunk_wordsint1500Maximum words per chunk
include_full_textboolfalseInclude full document text in response

Chunking Modes

ModeDescription
autoAnalyze document structure and choose the best strategy
semanticForce heading-based chunking
pageForce page-based chunking
paragraphForce paragraph-based chunking
disabledReturn full document as a single chunk

Response:

{
  "success": true,
  "source_file": "document.pdf",
  "format": ".pdf",
  "metadata": {
    "num_pages": 10,
    "title": "Document Title",
    "author": null
  },
  "chunks": [
    {
      "chunk_id": "chunk_1_0",
      "content": "Section content here...",
      "section_path": "Chapter 1 > Introduction",
      "heading": "Introduction",
      "heading_level": 2,
      "page_start": 1,
      "page_end": 2,
      "tables": [],
      "images": [],
      "chunk_index": 0,
      "total_chunks": 15,
      "word_count": 450,
      "parent_section_id": null,
      "language": "en",
      "is_continuation": false
    }
  ],
  "chunking_stats": {
    "strategy_used": "semantic",
    "total_chunks": 15,
    "sections_found": 12,
    "heading_distribution": {"1": 1, "2": 5, "3": 6},
    "avg_chunk_words": 420.5,
    "pages_processed": 10,
    "total_words": 6307
  },
  "full_text": null
}

POST /api/v1/parse/sections/stream

Parse document and stream sections via Server-Sent Events (SSE). The entire document is parsed first, then sections are streamed as they are chunked.

SSE Event Types

EventDescription
statusParsing status updates (parsing, processing)
metadataDocument metadata (after parsing completes)
sectionComplete section with headers, content, and tables
statsFinal chunking statistics
errorError message if parsing fails
doneEnd of stream marker

Form Fields: Same as /sections (except chunking_mode, target_chunk_words, max_chunk_words, include_full_text).

Example (JavaScript):

const formData = new FormData();
formData.append('file', file);

const response = await fetch('/api/v1/parse/sections/stream', {
  method: 'POST',
  headers: { 'X-API-Key': 'your-key' },
  body: formData,
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { value, done } = await reader.read();
  if (done) break;

  const text = decoder.decode(value);
  for (const line of text.split('\n')) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.slice(6));
      console.log('Section:', data);
    }
  }
}

POST /api/v1/parse/sections/incremental

Parse document with true incremental streaming. Unlike /sections/stream, this parses page-by-page and streams each chunk as soon as it's ready.

Form Fields:

FieldTypeDefaultDescription
filefilerequiredDocument to parse
remove_urlsbooltrueRemove URLs from text
normalize_arabicbooltrueNormalize Arabic characters
target_chunk_wordsint500Target words per chunk
max_chunk_wordsint1500Maximum words per chunk

Supported formats: .pdf, .md, .markdown, .txt, .html, .htm, .docx

SSE Event Types

EventDescription
metadataDocument metadata (as soon as known)
progressParsing progress (page X of Y)
chunkIndividual chunk
statsFinal statistics
doneEnd of stream

Trade-offs vs /sections/stream

/sections/stream/sections/incremental
First chunk latencyAfter full document parseAfter first page parse
Progress updatesNo per-page progressPer-page progress
Table/image extractionFull Docling accuracyHeuristic-based
Heading detectionAccurate (Markdown)Heuristic for PDFs

On this page