Generate a blueprint #
Step 1 of the two-step workflow. You point the generator at a URL; it fetches the page and auto-detects the item list, pagination, and field selectors (link, title, image, price, rating, address, description), printing a blueprint as JSON.
php artisan datahelm:scrap:generate <url> [--get-detail=true] [--max-pages=0] [--save]
Field detectors are configurable in config/crawler.php (detectors) — add your own to teach the generator domain-specific fields.
TIP
In the Docker stack, prefix with docker compose run --rm: docker compose run --rm artisan datahelm:scrap:generate <url> …
Worked example #
php artisan datahelm:scrap:generate \
https://www.exampleauctions.com/real-estate/apartments \
--get-detail=true --get-primary-image=true --hash-names=true --robot --robot-name=ExampleAuctions
The blueprint is plain data — open the JSON and fix any selector the heuristics got wrong before running it.
Options #
| Option | Effect |
|---|---|
--get-detail=true | Also visit a sample item to detect detail-page fields (incl. image gallery) |
--max-pages= | Page cap stored in the blueprint (default 0 = all pages) |
--get-all-images=true | Put every image URL into all_images in the JSON (default false) |
--get-primary-image=true | Put only the primary (most relevant) image URL per item into primary_image (default false) |
--get-gallery-images=true | Detect the detail-page gallery and put its image URLs into gallery_images; implies --get-detail (default false) |
--hash-names=true | Rename stored images to a unique content hash when you download them (default false) |
--http-delay= | Milliseconds to wait between page requests (default 0) |
--preset= | Detection heuristics profile: generic (default), ecommerce, auctions, properties. See Presets |
--http-timeout= | Request timeout in seconds (default 60) |
--http-retries= | Retry count on transient failures (default 3) |
--output-format= | Output format: json, jsonl, csv, markdown (default json). See Markdown output |
--dedup | Enable deduplication (drop items with a repeated key-field value) |
--dedup-key= | Field used as the dedup uniqueness key (default link) |
--page-delay= | Milliseconds to wait after each pagination page (default 0) |
--item-delay= | Milliseconds to wait after each item (default 0) |
--max-items= | Blueprint-level item cap (0 = unlimited) |
--api-endpoint= | Force API mode: the JSON endpoint a JS site calls (skips HTML detection) |
--api-method= | HTTP method for the API endpoint: GET (default) or POST |
--api-items-path= | Dot-path to the items array in the JSON (e.g. data.results.content) |
--search-filters= | JSON array of category pages to crawl with the same blueprint. See Multiple categories |
--transport= | HTTP transport, also baked into the blueprint: auto, guzzle, browser, flaresolverr, scraping_api. See Transports |
--render-js | Shortcut to set render_js=true (use a headless browser; needs the browser/flaresolverr transport) |
--header="K: V" | Extra request header (repeatable), baked into http_config.headers. Replay captured browser headers |
--cookie="a=1; b=2" | Cookies (e.g. captured _px3/_abck/session) baked into http_config.cookies — the free way past hard WAFs for a one-off run |
--json | Print the raw blueprint JSON to stdout instead of saving it |
--save | Save the blueprint under the host name for later datahelm:scrap:run <host> |
--robot | Scaffold a Robot{Name} command under app/Console/Commands/RobotsCommand. See Robots |
--robot-name= | Class base name for --robot (default: derived from the host) |
--force | Overwrite an existing robot command file |
What detection produces #
The generator suggests fields generously — as many as it can find in both the list and the detail page — so you keep what you want and delete the rest. A field that appears in both sections uses the same name in each: keep it under fields or detail_fields, whichever you prefer, and remove the other.
The key blueprint keys it fills in:
| Key | Meaning |
|---|---|
item_selector | CSS selector matching each row in the list |
scrape_detail | false = scrape the list only; true = also visit each item's detail page |
fields[] | Selectors read from each list row |
detail_link_field | Which list field holds the detail-page URL (when scrape_detail is true) |
detail_fields[] | Selectors read from the detail page |
pagination | strategy (none / link_list / next_link / infinite_scroll) + css |
See the full Blueprint format reference for every key.
Label-based fields #
Auction pages express facts as label: value pairs whose value elements share CSS classes (so CSS alone can't tell them apart). For those, a field carries a label instead of relying on css, and the value is looked up by that label at run time. The field name is derived from the label:
1ª Praça: 22/06/2026 às 10:01→"1_praca": "22/06/2026 às 10:01"Comitente→comitenteValor de Avaliação→valor_de_avaliacao
XPath selectors #
Any field in fields[] or detail_fields[] can use XPath instead of CSS by adding "type": "xpath":
{
"name": "price",
"css": "//span[contains(@class,'price')][1]",
"type": "xpath",
"attribute": null
}
"type" defaults to "css", so existing blueprints are unaffected. XPath is more expressive than CSS for attribute-based conditions, text matching, and ancestor/sibling traversal.
Validate a blueprint #
After hand-editing, check it parses and is structurally sound:
php artisan datahelm:scrap:validate path/to/blueprint.json
Next: Run a scrape →

