Skip to content
On this page

Generate a blueprint

Step 1 of the two-step workflow. You point the generator at a URL; it fetches the page and auto-detects the item list, pagination, and field selectors (link, title, image, price, rating, address, description), printing a blueprint as JSON.

bash
php artisan datahelm:scrap:generate <url> [--get-detail=true] [--max-pages=0] [--save]

Field detectors are configurable in config/crawler.php (detectors) — add your own to teach the generator domain-specific fields.

TIP

In the Docker stack, prefix with docker compose run --rm: docker compose run --rm artisan datahelm:scrap:generate <url> …

Worked example

bash
php artisan datahelm:scrap:generate \
  https://www.exampleauctions.com/real-estate/apartments \
  --get-detail=true --get-primary-image=true --hash-names=true --robot --robot-name=ExampleAuctions

The blueprint is plain data — open the JSON and fix any selector the heuristics got wrong before running it.

Options

OptionEffect
--get-detail=trueAlso visit a sample item to detect detail-page fields (incl. image gallery)
--max-pages=Page cap stored in the blueprint (default 0 = all pages)
--get-all-images=truePut every image URL into all_images in the JSON (default false)
--get-primary-image=truePut only the primary (most relevant) image URL per item into primary_image (default false)
--get-gallery-images=trueDetect the detail-page gallery and put its image URLs into gallery_images; implies --get-detail (default false)
--hash-names=trueRename stored images to a unique content hash when you download them (default false)
--http-delay=Milliseconds to wait between page requests (default 0)
--preset=Detection heuristics profile: generic (default), ecommerce, auctions, properties. See Presets
--http-timeout=Request timeout in seconds (default 60)
--http-retries=Retry count on transient failures (default 3)
--output-format=Output format: json, jsonl, csv, markdown (default json). See Markdown output
--dedupEnable deduplication (drop items with a repeated key-field value)
--dedup-key=Field used as the dedup uniqueness key (default link)
--page-delay=Milliseconds to wait after each pagination page (default 0)
--item-delay=Milliseconds to wait after each item (default 0)
--max-items=Blueprint-level item cap (0 = unlimited)
--api-endpoint=Force API mode: the JSON endpoint a JS site calls (skips HTML detection)
--api-method=HTTP method for the API endpoint: GET (default) or POST
--api-items-path=Dot-path to the items array in the JSON (e.g. data.results.content)
--search-filters=JSON array of category pages to crawl with the same blueprint. See Multiple categories
--transport=HTTP transport, also baked into the blueprint: auto, guzzle, browser, flaresolverr, scraping_api. See Transports
--render-jsShortcut to set render_js=true (use a headless browser; needs the browser/flaresolverr transport)
--header="K: V"Extra request header (repeatable), baked into http_config.headers. Replay captured browser headers
--cookie="a=1; b=2"Cookies (e.g. captured _px3/_abck/session) baked into http_config.cookies — the free way past hard WAFs for a one-off run
--jsonPrint the raw blueprint JSON to stdout instead of saving it
--saveSave the blueprint under the host name for later datahelm:scrap:run <host>
--robotScaffold a Robot{Name} command under app/Console/Commands/RobotsCommand. See Robots
--robot-name=Class base name for --robot (default: derived from the host)
--forceOverwrite an existing robot command file

What detection produces

The generator suggests fields generously — as many as it can find in both the list and the detail page — so you keep what you want and delete the rest. A field that appears in both sections uses the same name in each: keep it under fields or detail_fields, whichever you prefer, and remove the other.

The key blueprint keys it fills in:

KeyMeaning
item_selectorCSS selector matching each row in the list
scrape_detailfalse = scrape the list only; true = also visit each item's detail page
fields[]Selectors read from each list row
detail_link_fieldWhich list field holds the detail-page URL (when scrape_detail is true)
detail_fields[]Selectors read from the detail page
paginationstrategy (none / link_list / next_link / infinite_scroll) + css

See the full Blueprint format reference for every key.

Label-based fields

Auction pages express facts as label: value pairs whose value elements share CSS classes (so CSS alone can't tell them apart). For those, a field carries a label instead of relying on css, and the value is looked up by that label at run time. The field name is derived from the label:

  • 1ª Praça: 22/06/2026 às 10:01"1_praca": "22/06/2026 às 10:01"
  • Comitentecomitente
  • Valor de Avaliaçãovalor_de_avaliacao

XPath selectors

Any field in fields[] or detail_fields[] can use XPath instead of CSS by adding "type": "xpath":

json
{
  "name": "price",
  "css": "//span[contains(@class,'price')][1]",
  "type": "xpath",
  "attribute": null
}

"type" defaults to "css", so existing blueprints are unaffected. XPath is more expressive than CSS for attribute-based conditions, text matching, and ancestor/sibling traversal.

Validate a blueprint

After hand-editing, check it parses and is structurally sound:

bash
php artisan datahelm:scrap:validate path/to/blueprint.json

Next: Run a scrape →

Released under the MIT License.