API Reference
Scrape Endpoints
URL web scraping endpoints
Scrape Endpoints
Base path: /api/v1/scrape
POST /api/v1/scrape
Scrape a single URL synchronously. Fetches content, processes it through the RAG pipeline, and stores in the vector database.
Request Body:
{
"url": "https://docs.example.com/page",
"timeout": 30,
"skip_duplicates": true
}| Field | Type | Default | Description |
|---|---|---|---|
url | string (URL) | required | URL to scrape |
timeout | int | 30 | Request timeout in seconds (5-120) |
skip_duplicates | bool | true | Skip if URL or content already exists |
Response:
{
"success": true,
"url": "https://docs.example.com/page",
"title": "Page Title",
"chunks_added": 15,
"chunks_updated": 0,
"chunks_skipped": 0,
"error": null,
"is_duplicate": false,
"duplicate_type": null,
"existing_doc_id": null,
"existing_doc_title": null,
"duplicate_message": null
}Duplicate Detection Fields
| Field | Type | Description |
|---|---|---|
is_duplicate | bool | Whether the URL was detected as a duplicate |
duplicate_type | string | null | Type of duplicate: "url" or "content_hash" |
existing_doc_id | string | null | ID of the existing document if duplicate |
existing_doc_title | string | null | Title of the existing document if duplicate |
duplicate_message | string | null | Human-readable duplicate detection message |
Example:
curl -X POST http://localhost:9000/api/v1/scrape \
-H "X-API-Key: your-key" \
-H "Content-Type: application/json" \
-d '{"url": "https://docs.example.com/page"}'POST /api/v1/scrape/async
Scrape multiple URLs asynchronously in the background.
Request Body:
{
"urls": [
"https://docs.example.com/page1",
"https://docs.example.com/page2",
"https://docs.example.com/page3"
],
"timeout": 30,
"skip_duplicates": true
}| Field | Type | Default | Description |
|---|---|---|---|
urls | string[] | required | URLs to scrape (1-50) |
timeout | int | 30 | Request timeout per URL (5-120) |
skip_duplicates | bool | true | Skip duplicate URLs/content |
Response:
{
"task_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"message": "Scraping 3 URL(s) in background",
"created_at": "2025-01-15T10:30:00.000000"
}Use the returned task_id with the Tasks endpoints to monitor progress.