Mjara Docs
API Reference

Scrape Endpoints

URL web scraping endpoints

Scrape Endpoints

Base path: /api/v1/scrape

POST /api/v1/scrape

Scrape a single URL synchronously. Fetches content, processes it through the RAG pipeline, and stores in the vector database.

Request Body:

{
  "url": "https://docs.example.com/page",
  "timeout": 30,
  "skip_duplicates": true
}
FieldTypeDefaultDescription
urlstring (URL)requiredURL to scrape
timeoutint30Request timeout in seconds (5-120)
skip_duplicatesbooltrueSkip if URL or content already exists

Response:

{
  "success": true,
  "url": "https://docs.example.com/page",
  "title": "Page Title",
  "chunks_added": 15,
  "chunks_updated": 0,
  "chunks_skipped": 0,
  "error": null,
  "is_duplicate": false,
  "duplicate_type": null,
  "existing_doc_id": null,
  "existing_doc_title": null,
  "duplicate_message": null
}

Duplicate Detection Fields

FieldTypeDescription
is_duplicateboolWhether the URL was detected as a duplicate
duplicate_typestring | nullType of duplicate: "url" or "content_hash"
existing_doc_idstring | nullID of the existing document if duplicate
existing_doc_titlestring | nullTitle of the existing document if duplicate
duplicate_messagestring | nullHuman-readable duplicate detection message

Example:

curl -X POST http://localhost:9000/api/v1/scrape \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com/page"}'

POST /api/v1/scrape/async

Scrape multiple URLs asynchronously in the background.

Request Body:

{
  "urls": [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2",
    "https://docs.example.com/page3"
  ],
  "timeout": 30,
  "skip_duplicates": true
}
FieldTypeDefaultDescription
urlsstring[]requiredURLs to scrape (1-50)
timeoutint30Request timeout per URL (5-120)
skip_duplicatesbooltrueSkip duplicate URLs/content

Response:

{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "pending",
  "message": "Scraping 3 URL(s) in background",
  "created_at": "2025-01-15T10:30:00.000000"
}

Use the returned task_id with the Tasks endpoints to monitor progress.

On this page