Blueprint format
A blueprint is plain JSON describing how to scrape one site. Auto-detection produces a first draft; you refine it by hand. This page documents every top-level key and config block.
Top-level keys
| Key | Meaning |
|---|---|
url | Base URL of the listing |
mode | Crawl mode — omit/html for HTML scraping, "api" for API mode |
search_filters | Optional list of category pages crawled with this same blueprint. See Multiple categories |
item_selector | CSS selector matching each row in the list |
scrape_detail | false (standard) = scrape the list only; true = also visit each item's detail page |
fields[] | Selectors read from each list row |
detail_link_field | Which list field holds the detail-page URL (used when scrape_detail is true) |
detail_fields[] | Selectors read from the detail page |
pagination | strategy (none / link_list / next_link / infinite_scroll) + css |
infinite_scroll | Block for the infinite_scroll strategy — see Infinite scroll |
api | Block for mode: "api" — see API mode |
max_pages | Page cap (0 = all pages) |
image_folder | Override storage path for downloaded images (default scrapes/images/{host}/) |
get_primary_image / get_all_images / get_gallery_images / hash_names | Image options — see Images |
http_config | Transport-level settings and polite crawling (below) |
crawl_config | Crawl-loop throttles and item cap (below) |
output_config | Output shaping and serialization (below) |
dedup | Deduplication (below) |
result_filters | Conditional item filters — see Result filters |
cache | HTTP cache (below) |
auto_throttle | Adaptive delay (below) |
pipeline_names | Replace the global item pipeline for this blueprint — see Presets & item pipeline |
item_schema | Field → type coercion map, e.g. {"price": "float", "images": "string[]"} |
resumable | Persist dedup state between runs so re-runs only process new items (--resume) |
Field selectors
Each entry in fields[] / detail_fields[]:
| Key | Meaning |
|---|---|
name | Field name in the output item |
css | CSS selector — or, with type: "xpath"/"json", an XPath expression / JSON dot-path |
type | "css" (default), "xpath", "json" (API mode), or "markdown" — locate with CSS, render the element's content as clean Markdown. See Markdown output |
attribute | Attribute to read (href, src, …); omit/null to read text |
regex | Optional regex post-processing of the extracted string |
multiple | true returns an array of all matches (e.g. galleries) |
label | Label-based lookup for label: value pages (see Generate) |
HTTP configuration
Controls transport-level settings and polite crawling:
"http_config": {
"timeout": 60,
"delay_ms": 500,
"retry_count": 3,
"retry_delay_ms": 1000,
"user_agent": "Mozilla/5.0 ...",
"headers": { "Accept-Language": "pt-BR" },
"proxies": ["http://proxy1:8080", "http://proxy2:8080"],
"cookies": [{ "name": "session", "value": "abc123", "domain": "example.com" }],
"transport": "flaresolverr"
}
delay_msis the most important setting for sites like auction platforms — 300–500 ms avoids rate limiting.proxiesare rotated round-robin on each request. Use fullhttp://user:pass@host:portURLs.retry_countretries failed requests automatically before giving up.cookiesare pre-set cookies sent on every request — useful for sites that require login.transportpins the HTTP transport for this blueprint (null= use the globalCRAWLER_TRANSPORT), so a generated robot remembers the transport its target needs. See Transports.
Crawl configuration
Throttles the crawl loop and sets a permanent item cap:
"crawl_config": {
"delay_between_pages_ms": 1000,
"delay_between_items_ms": 200,
"max_items": 500
}
max_items is a blueprint-level hard cap (0 = no cap). The per-run --limit flag takes precedence when non-zero.
Output configuration
Shapes and serializes results:
"output_config": {
"format": "json",
"flatten": false,
"exclude_fields": ["image", "link"],
"rename_fields": { "1_praca": "first_auction", "valor_de_avaliacao": "appraisal_value" },
"stream": false
}
format | Description |
|---|---|
json | Pretty-printed JSON array (default) |
jsonl | One JSON object per line — better for large crawls, easy to stream |
csv | Comma-separated; first row is headers; array values are JSON-encoded in their cell |
markdown | Single Markdown document, one section per item — for LLM / RAG ingestion. See Markdown output |
flatten: truecollapses nested arrays:saved_images[0]→saved_images_0.stream: truewrites each item to disk as it is scraped (works with all formats) — see Streaming output.
Deduplication
Drops items whose key field has already been seen in the current run:
"dedup": {
"enabled": true,
"key_field": "link"
}
Prevents duplicate records when pagination pages overlap, or when a robot re-runs against a listing that hasn't changed much.
HTTP cache
Caches raw page HTML to disk so re-running a robot replays from cache instead of hitting the live site. Perfect when iterating on selectors:
"cache": {
"enabled": false,
"path": "app/cache/http",
"ttl_seconds": 3600
}
| Field | Type | Default | Meaning |
|---|---|---|---|
enabled | bool | false | Master switch |
path | string | "app/cache/http" | Directory under storage_path(); subdirectories per host |
ttl_seconds | int | 3600 (1 h) | Cache lifetime; 0 = cache forever (clear manually) |
Cache stats (hits / misses) appear in the run summary.
AutoThrottle
Dynamically adjusts the inter-page delay based on actual server response latency (mimics Scrapy's AutoThrottle extension). Slow server → longer pause; fast server → shorter pause:
"auto_throttle": {
"enabled": false,
"target_concurrency": 1.0,
"start_delay_ms": 0,
"max_delay_ms": 30000,
"debug": false
}
| Field | Default | Meaning |
|---|---|---|
enabled | false | Master switch |
target_concurrency | 1.0 | Expected parallel requests (keep at 1.0 for sequential crawls) |
start_delay_ms | 0 | Initial delay before the first request |
max_delay_ms | 30000 | Upper bound on the computed delay (safety cap) |
debug | false | Write computed delay to STDERR after each page |
When enabled, http_config.delay_ms is used as the minimum delay floor; AutoThrottle only increases from there.
Validating
After hand-editing, run:
php artisan datahelm:scrap:validate path/to/blueprint.json

