Core concepts
A handful of ideas underpin everything else. Once these click, the rest of the docs are just detail.
Blueprint
A blueprint is plain, editable JSON that fully describes how to scrape one site: the item list selector, pagination strategy, field selectors, transport, image options, filters, output format and more. Auto-detection produces a first draft; you refine it by hand.
It is just data — there is no hidden state. The full schema is in the Blueprint format reference.
{
"url": "https://books.toscrape.com/",
"item_selector": "article.product_pod",
"scrape_detail": false,
"fields": [
{ "name": "title", "css": "h3 a", "attribute": "title" },
{ "name": "price", "css": ".price_color" },
{ "name": "link", "css": "h3 a", "attribute": "href" }
],
"pagination": { "strategy": "next_link", "css": "li.next a" }
}
Generate → Run
The crawler is two steps, each its own Artisan command:
| Step | Command | What it does |
|---|---|---|
| Generate | datahelm:scrap:generate <url> | Fetches the URL, auto-detects structure, emits a blueprint |
| Run | datahelm:scrap:run <blueprint> | Loads a blueprint, crawls, extracts and exports items |
Generation is something you do once (and then tweak). Running is what you repeat to collect data.
Fields and detail pages
A field is one selector read from each item. By default fields are read from the list row (fields[]). When scrape_detail is true, the crawler also visits each item's detail page and reads detail fields (detail_fields[]) from it, following the URL in the detail_link_field.
Each field has a name, a css (or xpath/json) selector, an optional attribute, regex, multiple flag, and label. Auto-detection suggests many fields generously — keep what you want, delete the rest.
Pagination strategies
How the crawler moves past page one:
| Strategy | Use for |
|---|---|
none | Single page, no pagination |
link_list | A row of numbered page links |
next_link | A "Next ›" link |
infinite_scroll | "Load more" buttons that fetch HTML fragments — see Infinite scroll |
Crawl modes
Most sites are scraped as HTML. Two alternative modes handle modern sites:
- API mode (
"mode": "api") — for JavaScript SPAs backed by a JSON endpoint. Fields are read by dot-path instead of CSS. See JavaScript sites & JSON APIs. - Infinite scroll — an HTML pagination strategy for "Load more" listings that return HTML fragments. See Infinite scroll.
Transport
A transport is how a page is fetched. The crawler abstracts the difference between plain HTTP, headless Chrome, a Cloudflare solver and a managed scraping API behind one setting. auto climbs the ladder for you, identifying the WAF vendor and escalating only as far as needed. See HTTP transports & bot protection.
guzzle ─► browser ─► flaresolverr ─► scraping_api
fast JS Cloudflare hardest WAFs (paid)
Robot
A robot is a generated, self-contained Artisan command (datahelm:robot:{name}) with the blueprint JSON embedded in the file — so it needs no external storage. Robots are where per-item logic lives: image downloading, processing, and persistence (Eloquent, queue, webhook). Scaffold one with the --robot flag. See Scaffold a robot.
Sinks and output
Extracted items flow to a sink. The CLI prints/saves JSON by default, but robots use a CallbackSink to run per-item code. Other sinks ship for database, queue, webhook and file targets. Output can be JSON, JSONL, CSV or LLM-ready Markdown, optionally streamed to disk as items arrive.
Pipeline
Between extraction and output, each item passes through an item pipeline of processors — trimming whitespace, resolving relative URLs to absolute, coercing types to the schema. It is configurable per blueprint; see Presets & item pipeline.
Next: Generate a blueprint →

