Skip to content
On this page

Blueprint format

A blueprint is plain JSON describing how to scrape one site. Auto-detection produces a first draft; you refine it by hand. This page documents every top-level key and config block.

Top-level keys

KeyMeaning
urlBase URL of the listing
modeCrawl mode — omit/html for HTML scraping, "api" for API mode
search_filtersOptional list of category pages crawled with this same blueprint. See Multiple categories
item_selectorCSS selector matching each row in the list
scrape_detailfalse (standard) = scrape the list only; true = also visit each item's detail page
fields[]Selectors read from each list row
detail_link_fieldWhich list field holds the detail-page URL (used when scrape_detail is true)
detail_fields[]Selectors read from the detail page
paginationstrategy (none / link_list / next_link / infinite_scroll) + css
infinite_scrollBlock for the infinite_scroll strategy — see Infinite scroll
apiBlock for mode: "api" — see API mode
max_pagesPage cap (0 = all pages)
image_folderOverride storage path for downloaded images (default scrapes/images/{host}/)
get_primary_image / get_all_images / get_gallery_images / hash_namesImage options — see Images
http_configTransport-level settings and polite crawling (below)
crawl_configCrawl-loop throttles and item cap (below)
output_configOutput shaping and serialization (below)
dedupDeduplication (below)
result_filtersConditional item filters — see Result filters
cacheHTTP cache (below)
auto_throttleAdaptive delay (below)
pipeline_namesReplace the global item pipeline for this blueprint — see Presets & item pipeline
item_schemaField → type coercion map, e.g. {"price": "float", "images": "string[]"}
resumablePersist dedup state between runs so re-runs only process new items (--resume)

Field selectors

Each entry in fields[] / detail_fields[]:

KeyMeaning
nameField name in the output item
cssCSS selector — or, with type: "xpath"/"json", an XPath expression / JSON dot-path
type"css" (default), "xpath", "json" (API mode), or "markdown" — locate with CSS, render the element's content as clean Markdown. See Markdown output
attributeAttribute to read (href, src, …); omit/null to read text
regexOptional regex post-processing of the extracted string
multipletrue returns an array of all matches (e.g. galleries)
labelLabel-based lookup for label: value pages (see Generate)

HTTP configuration

Controls transport-level settings and polite crawling:

json
"http_config": {
  "timeout": 60,
  "delay_ms": 500,
  "retry_count": 3,
  "retry_delay_ms": 1000,
  "user_agent": "Mozilla/5.0 ...",
  "headers": { "Accept-Language": "pt-BR" },
  "proxies": ["http://proxy1:8080", "http://proxy2:8080"],
  "cookies": [{ "name": "session", "value": "abc123", "domain": "example.com" }],
  "transport": "flaresolverr"
}
  • delay_ms is the most important setting for sites like auction platforms — 300–500 ms avoids rate limiting.
  • proxies are rotated round-robin on each request. Use full http://user:pass@host:port URLs.
  • retry_count retries failed requests automatically before giving up.
  • cookies are pre-set cookies sent on every request — useful for sites that require login.
  • transport pins the HTTP transport for this blueprint (null = use the global CRAWLER_TRANSPORT), so a generated robot remembers the transport its target needs. See Transports.

Crawl configuration

Throttles the crawl loop and sets a permanent item cap:

json
"crawl_config": {
  "delay_between_pages_ms": 1000,
  "delay_between_items_ms": 200,
  "max_items": 500
}

max_items is a blueprint-level hard cap (0 = no cap). The per-run --limit flag takes precedence when non-zero.

Output configuration

Shapes and serializes results:

json
"output_config": {
  "format": "json",
  "flatten": false,
  "exclude_fields": ["image", "link"],
  "rename_fields": { "1_praca": "first_auction", "valor_de_avaliacao": "appraisal_value" },
  "stream": false
}
formatDescription
jsonPretty-printed JSON array (default)
jsonlOne JSON object per line — better for large crawls, easy to stream
csvComma-separated; first row is headers; array values are JSON-encoded in their cell
markdownSingle Markdown document, one section per item — for LLM / RAG ingestion. See Markdown output
  • flatten: true collapses nested arrays: saved_images[0]saved_images_0.
  • stream: true writes each item to disk as it is scraped (works with all formats) — see Streaming output.

Deduplication

Drops items whose key field has already been seen in the current run:

json
"dedup": {
  "enabled": true,
  "key_field": "link"
}

Prevents duplicate records when pagination pages overlap, or when a robot re-runs against a listing that hasn't changed much.

HTTP cache

Caches raw page HTML to disk so re-running a robot replays from cache instead of hitting the live site. Perfect when iterating on selectors:

json
"cache": {
  "enabled": false,
  "path": "app/cache/http",
  "ttl_seconds": 3600
}
FieldTypeDefaultMeaning
enabledboolfalseMaster switch
pathstring"app/cache/http"Directory under storage_path(); subdirectories per host
ttl_secondsint3600 (1 h)Cache lifetime; 0 = cache forever (clear manually)

Cache stats (hits / misses) appear in the run summary.

AutoThrottle

Dynamically adjusts the inter-page delay based on actual server response latency (mimics Scrapy's AutoThrottle extension). Slow server → longer pause; fast server → shorter pause:

json
"auto_throttle": {
  "enabled": false,
  "target_concurrency": 1.0,
  "start_delay_ms": 0,
  "max_delay_ms": 30000,
  "debug": false
}
FieldDefaultMeaning
enabledfalseMaster switch
target_concurrency1.0Expected parallel requests (keep at 1.0 for sequential crawls)
start_delay_ms0Initial delay before the first request
max_delay_ms30000Upper bound on the computed delay (safety cap)
debugfalseWrite computed delay to STDERR after each page

When enabled, http_config.delay_ms is used as the minimum delay floor; AutoThrottle only increases from there.

Validating

After hand-editing, run:

bash
php artisan datahelm:scrap:validate path/to/blueprint.json

Released under the MIT License.