Scrape Endpoints

Base path: /api/v1/scrape

POST `/api/v1/scrape`

Scrape a single URL synchronously. Fetches content, processes it through the RAG pipeline, and stores in the vector database.

Request Body:

{
  "url": "https://docs.example.com/page",
  "timeout": 30,
  "skip_duplicates": true
}

Field	Type	Default	Description
`url`	string (URL)	required	URL to scrape
`timeout`	int	30	Request timeout in seconds (5-120)
`skip_duplicates`	bool	true	Skip if URL or content already exists

Response:

{
  "success": true,
  "url": "https://docs.example.com/page",
  "title": "Page Title",
  "chunks_added": 15,
  "chunks_updated": 0,
  "chunks_skipped": 0,
  "error": null,
  "is_duplicate": false,
  "duplicate_type": null,
  "existing_doc_id": null,
  "existing_doc_title": null,
  "duplicate_message": null
}

Duplicate Detection Fields

Field	Type	Description
`is_duplicate`	bool	Whether the URL was detected as a duplicate
`duplicate_type`	string \| null	Type of duplicate: `"url"` or `"content_hash"`
`existing_doc_id`	string \| null	ID of the existing document if duplicate
`existing_doc_title`	string \| null	Title of the existing document if duplicate
`duplicate_message`	string \| null	Human-readable duplicate detection message

Example:

curl -X POST http://localhost:9000/api/v1/scrape \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com/page"}'

POST `/api/v1/scrape/async`

Scrape multiple URLs asynchronously in the background.

Request Body:

{
  "urls": [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2",
    "https://docs.example.com/page3"
  ],
  "timeout": 30,
  "skip_duplicates": true
}

Field	Type	Default	Description
`urls`	string[]	required	URLs to scrape (1-50)
`timeout`	int	30	Request timeout per URL (5-120)
`skip_duplicates`	bool	true	Skip duplicate URLs/content

Response:

{
  "task_id": "550e8400-e29b-41d4-a716-446655440000",
  "status": "pending",
  "message": "Scraping 3 URL(s) in background",
  "created_at": "2025-01-15T10:30:00.000000"
}

Use the returned task_id with the Tasks endpoints to monitor progress.

Scrape Endpoints

Scrape Endpoints

POST /api/v1/scrape

Duplicate Detection Fields

POST /api/v1/scrape/async

On this page

POST `/api/v1/scrape`

POST `/api/v1/scrape/async`