Blueprint format

A blueprint is plain JSON describing how to scrape one site. Auto-detection produces a first draft; you refine it by hand. This page documents every top-level key and config block.

Top-level keys

Key	Meaning
`url`	Base URL of the listing
`mode`	Crawl mode — omit/`html` for HTML scraping, `"api"` for API mode
`search_filters`	Optional list of category pages crawled with this same blueprint. See Multiple categories
`item_selector`	CSS selector matching each row in the list
`scrape_detail`	`false` (standard) = scrape the list only; `true` = also visit each item's detail page
`fields[]`	Selectors read from each list row
`detail_link_field`	Which list field holds the detail-page URL (used when `scrape_detail` is `true`)
`detail_fields[]`	Selectors read from the detail page
`pagination`	`strategy` (`none` / `link_list` / `next_link` / `infinite_scroll`) + `css`
`infinite_scroll`	Block for the `infinite_scroll` strategy — see Infinite scroll
`api`	Block for `mode: "api"` — see API mode
`max_pages`	Page cap (`0` = all pages)
`image_folder`	Override storage path for downloaded images (default `scrapes/images/{host}/`)
`get_primary_image` / `get_all_images` / `get_gallery_images` / `hash_names`	Image options — see Images
`http_config`	Transport-level settings and polite crawling (below)
`crawl_config`	Crawl-loop throttles and item cap (below)
`output_config`	Output shaping and serialization (below)
`dedup`	Deduplication (below)
`result_filters`	Conditional item filters — see Result filters
`cache`	HTTP cache (below)
`auto_throttle`	Adaptive delay (below)
`pipeline_names`	Replace the global item pipeline for this blueprint — see Presets & item pipeline
`item_schema`	Field → type coercion map, e.g. `{"price": "float", "images": "string[]"}`
`resumable`	Persist dedup state between runs so re-runs only process new items (`--resume`)

Field selectors

Each entry in fields[] / detail_fields[]:

Key	Meaning
`name`	Field name in the output item
`css`	CSS selector — or, with `type: "xpath"`/`"json"`, an XPath expression / JSON dot-path
`type`	`"css"` (default), `"xpath"`, `"json"` (API mode), or `"markdown"` — locate with CSS, render the element's content as clean Markdown. See Markdown output
`attribute`	Attribute to read (`href`, `src`, …); omit/`null` to read text
`regex`	Optional regex post-processing of the extracted string
`multiple`	`true` returns an array of all matches (e.g. galleries)
`label`	Label-based lookup for `label: value` pages (see Generate)

HTTP configuration

Controls transport-level settings and polite crawling:

json

"http_config": {
  "timeout": 60,
  "delay_ms": 500,
  "retry_count": 3,
  "retry_delay_ms": 1000,
  "user_agent": "Mozilla/5.0 ...",
  "headers": { "Accept-Language": "pt-BR" },
  "proxies": ["http://proxy1:8080", "http://proxy2:8080"],
  "cookies": [{ "name": "session", "value": "abc123", "domain": "example.com" }],
  "transport": "flaresolverr"
}

delay_ms is the most important setting for sites like auction platforms — 300–500 ms avoids rate limiting.
proxies are rotated round-robin on each request. Use full http://user:pass@host:port URLs.
retry_count retries failed requests automatically before giving up.
cookies are pre-set cookies sent on every request — useful for sites that require login.
transport pins the HTTP transport for this blueprint (null = use the global CRAWLER_TRANSPORT), so a generated robot remembers the transport its target needs. See Transports.

Crawl configuration

Throttles the crawl loop and sets a permanent item cap:

json

"crawl_config": {
  "delay_between_pages_ms": 1000,
  "delay_between_items_ms": 200,
  "max_items": 500
}

max_items is a blueprint-level hard cap (0 = no cap). The per-run --limit flag takes precedence when non-zero.

Output configuration

Shapes and serializes results:

json

"output_config": {
  "format": "json",
  "flatten": false,
  "exclude_fields": ["image", "link"],
  "rename_fields": { "1_praca": "first_auction", "valor_de_avaliacao": "appraisal_value" },
  "stream": false
}

`format`	Description
`json`	Pretty-printed JSON array (default)
`jsonl`	One JSON object per line — better for large crawls, easy to stream
`csv`	Comma-separated; first row is headers; array values are JSON-encoded in their cell
`markdown`	Single Markdown document, one section per item — for LLM / RAG ingestion. See Markdown output

flatten: true collapses nested arrays: saved_images[0] → saved_images_0.
stream: true writes each item to disk as it is scraped (works with all formats) — see Streaming output.

Deduplication

Drops items whose key field has already been seen in the current run:

json

"dedup": {
  "enabled": true,
  "key_field": "link"
}

Prevents duplicate records when pagination pages overlap, or when a robot re-runs against a listing that hasn't changed much.

HTTP cache

Caches raw page HTML to disk so re-running a robot replays from cache instead of hitting the live site. Perfect when iterating on selectors:

json

"cache": {
  "enabled": false,
  "path": "app/cache/http",
  "ttl_seconds": 3600
}

Field	Type	Default	Meaning
`enabled`	bool	`false`	Master switch
`path`	string	`"app/cache/http"`	Directory under `storage_path()`; subdirectories per host
`ttl_seconds`	int	`3600` (1 h)	Cache lifetime; `0` = cache forever (clear manually)

Cache stats (hits / misses) appear in the run summary.

AutoThrottle

Dynamically adjusts the inter-page delay based on actual server response latency (mimics Scrapy's AutoThrottle extension). Slow server → longer pause; fast server → shorter pause:

json

"auto_throttle": {
  "enabled": false,
  "target_concurrency": 1.0,
  "start_delay_ms": 0,
  "max_delay_ms": 30000,
  "debug": false
}

Field	Default	Meaning
`enabled`	`false`	Master switch
`target_concurrency`	`1.0`	Expected parallel requests (keep at `1.0` for sequential crawls)
`start_delay_ms`	`0`	Initial delay before the first request
`max_delay_ms`	`30000`	Upper bound on the computed delay (safety cap)
`debug`	`false`	Write computed delay to STDERR after each page

When enabled, http_config.delay_ms is used as the minimum delay floor; AutoThrottle only increases from there.

Validating

After hand-editing, run:

bash

php artisan datahelm:scrap:validate path/to/blueprint.json

Blueprint format #

Top-level keys #

Field selectors #

HTTP configuration #

Crawl configuration #

Output configuration #

Deduplication #

HTTP cache #

AutoThrottle #

Validating #

Blueprint format

Top-level keys

Field selectors

HTTP configuration

Crawl configuration

Output configuration

Deduplication

HTTP cache

AutoThrottle

Validating