Generate a blueprint

Step 1 of the two-step workflow. You point the generator at a URL; it fetches the page and auto-detects the item list, pagination, and field selectors (link, title, image, price, rating, address, description), printing a blueprint as JSON.

bash

php artisan datahelm:scrap:generate <url> [--get-detail=true] [--max-pages=0] [--save]

Field detectors are configurable in config/crawler.php (detectors) — add your own to teach the generator domain-specific fields.

TIP

In the Docker stack, prefix with docker compose run --rm: docker compose run --rm artisan datahelm:scrap:generate <url> …

Worked example

bash

php artisan datahelm:scrap:generate \
  https://www.exampleauctions.com/real-estate/apartments \
  --get-detail=true --get-primary-image=true --hash-names=true --robot --robot-name=ExampleAuctions

The blueprint is plain data — open the JSON and fix any selector the heuristics got wrong before running it.

Options

Option	Effect
`--get-detail=true`	Also visit a sample item to detect detail-page fields (incl. image gallery)
`--max-pages=`	Page cap stored in the blueprint (default `0` = all pages)
`--get-all-images=true`	Put every image URL into `all_images` in the JSON (default `false`)
`--get-primary-image=true`	Put only the primary (most relevant) image URL per item into `primary_image` (default `false`)
`--get-gallery-images=true`	Detect the detail-page gallery and put its image URLs into `gallery_images`; implies `--get-detail` (default `false`)
`--hash-names=true`	Rename stored images to a unique content hash when you download them (default `false`)
`--http-delay=`	Milliseconds to wait between page requests (default `0`)
`--preset=`	Detection heuristics profile: `generic` (default), `ecommerce`, `auctions`, `properties`. See Presets
`--http-timeout=`	Request timeout in seconds (default `60`)
`--http-retries=`	Retry count on transient failures (default `3`)
`--output-format=`	Output format: `json`, `jsonl`, `csv`, `markdown` (default `json`). See Markdown output
`--dedup`	Enable deduplication (drop items with a repeated key-field value)
`--dedup-key=`	Field used as the dedup uniqueness key (default `link`)
`--page-delay=`	Milliseconds to wait after each pagination page (default `0`)
`--item-delay=`	Milliseconds to wait after each item (default `0`)
`--max-items=`	Blueprint-level item cap (`0` = unlimited)
`--api-endpoint=`	Force API mode: the JSON endpoint a JS site calls (skips HTML detection)
`--api-method=`	HTTP method for the API endpoint: `GET` (default) or `POST`
`--api-items-path=`	Dot-path to the items array in the JSON (e.g. `data.results.content`)
`--search-filters=`	JSON array of category pages to crawl with the same blueprint. See Multiple categories
`--transport=`	HTTP transport, also baked into the blueprint: `auto`, `guzzle`, `browser`, `flaresolverr`, `scraping_api`. See Transports
`--render-js`	Shortcut to set `render_js=true` (use a headless browser; needs the `browser`/`flaresolverr` transport)
`--header="K: V"`	Extra request header (repeatable), baked into `http_config.headers`. Replay captured browser headers
`--cookie="a=1; b=2"`	Cookies (e.g. captured `_px3`/`_abck`/session) baked into `http_config.cookies` — the free way past hard WAFs for a one-off run
`--json`	Print the raw blueprint JSON to stdout instead of saving it
`--save`	Save the blueprint under the host name for later `datahelm:scrap:run <host>`
`--robot`	Scaffold a `Robot{Name}` command under `app/Console/Commands/RobotsCommand`. See Robots
`--robot-name=`	Class base name for `--robot` (default: derived from the host)
`--force`	Overwrite an existing robot command file

What detection produces

The generator suggests fields generously — as many as it can find in both the list and the detail page — so you keep what you want and delete the rest. A field that appears in both sections uses the same name in each: keep it under fields or detail_fields, whichever you prefer, and remove the other.

The key blueprint keys it fills in:

Key	Meaning
`item_selector`	CSS selector matching each row in the list
`scrape_detail`	`false` = scrape the list only; `true` = also visit each item's detail page
`fields[]`	Selectors read from each list row
`detail_link_field`	Which list field holds the detail-page URL (when `scrape_detail` is `true`)
`detail_fields[]`	Selectors read from the detail page
`pagination`	`strategy` (`none` / `link_list` / `next_link` / `infinite_scroll`) + `css`

See the full Blueprint format reference for every key.

Label-based fields

Auction pages express facts as label: value pairs whose value elements share CSS classes (so CSS alone can't tell them apart). For those, a field carries a label instead of relying on css, and the value is looked up by that label at run time. The field name is derived from the label:

1ª Praça: 22/06/2026 às 10:01 → "1_praca": "22/06/2026 às 10:01"
Comitente → comitente
Valor de Avaliação → valor_de_avaliacao

XPath selectors

Any field in fields[] or detail_fields[] can use XPath instead of CSS by adding "type": "xpath":

json

{
  "name": "price",
  "css": "//span[contains(@class,'price')][1]",
  "type": "xpath",
  "attribute": null
}

"type" defaults to "css", so existing blueprints are unaffected. XPath is more expressive than CSS for attribute-based conditions, text matching, and ancestor/sibling traversal.

Validate a blueprint

After hand-editing, check it parses and is structurally sound:

bash

php artisan datahelm:scrap:validate path/to/blueprint.json

Next: Run a scrape →

Generate a blueprint #

Worked example #

Options #

What detection produces #

Label-based fields #

XPath selectors #

Validate a blueprint #

Generate a blueprint

Worked example

Options

What detection produces

Label-based fields

XPath selectors

Validate a blueprint