Skip to content
On this page

Core concepts

A handful of ideas underpin everything else. Once these click, the rest of the docs are just detail.

Blueprint

A blueprint is plain, editable JSON that fully describes how to scrape one site: the item list selector, pagination strategy, field selectors, transport, image options, filters, output format and more. Auto-detection produces a first draft; you refine it by hand.

It is just data — there is no hidden state. The full schema is in the Blueprint format reference.

json
{
  "url": "https://books.toscrape.com/",
  "item_selector": "article.product_pod",
  "scrape_detail": false,
  "fields": [
    { "name": "title", "css": "h3 a", "attribute": "title" },
    { "name": "price", "css": ".price_color" },
    { "name": "link",  "css": "h3 a", "attribute": "href" }
  ],
  "pagination": { "strategy": "next_link", "css": "li.next a" }
}

Generate → Run

The crawler is two steps, each its own Artisan command:

StepCommandWhat it does
Generatedatahelm:scrap:generate <url>Fetches the URL, auto-detects structure, emits a blueprint
Rundatahelm:scrap:run <blueprint>Loads a blueprint, crawls, extracts and exports items

Generation is something you do once (and then tweak). Running is what you repeat to collect data.

Fields and detail pages

A field is one selector read from each item. By default fields are read from the list row (fields[]). When scrape_detail is true, the crawler also visits each item's detail page and reads detail fields (detail_fields[]) from it, following the URL in the detail_link_field.

Each field has a name, a css (or xpath/json) selector, an optional attribute, regex, multiple flag, and label. Auto-detection suggests many fields generously — keep what you want, delete the rest.

Pagination strategies

How the crawler moves past page one:

StrategyUse for
noneSingle page, no pagination
link_listA row of numbered page links
next_linkA "Next ›" link
infinite_scroll"Load more" buttons that fetch HTML fragments — see Infinite scroll

Crawl modes

Most sites are scraped as HTML. Two alternative modes handle modern sites:

  • API mode ("mode": "api") — for JavaScript SPAs backed by a JSON endpoint. Fields are read by dot-path instead of CSS. See JavaScript sites & JSON APIs.
  • Infinite scroll — an HTML pagination strategy for "Load more" listings that return HTML fragments. See Infinite scroll.

Transport

A transport is how a page is fetched. The crawler abstracts the difference between plain HTTP, headless Chrome, a Cloudflare solver and a managed scraping API behind one setting. auto climbs the ladder for you, identifying the WAF vendor and escalating only as far as needed. See HTTP transports & bot protection.

guzzle ─► browser ─► flaresolverr ─► scraping_api
 fast       JS         Cloudflare       hardest WAFs (paid)

Robot

A robot is a generated, self-contained Artisan command (datahelm:robot:{name}) with the blueprint JSON embedded in the file — so it needs no external storage. Robots are where per-item logic lives: image downloading, processing, and persistence (Eloquent, queue, webhook). Scaffold one with the --robot flag. See Scaffold a robot.

Sinks and output

Extracted items flow to a sink. The CLI prints/saves JSON by default, but robots use a CallbackSink to run per-item code. Other sinks ship for database, queue, webhook and file targets. Output can be JSON, JSONL, CSV or LLM-ready Markdown, optionally streamed to disk as items arrive.

Pipeline

Between extraction and output, each item passes through an item pipeline of processors — trimming whitespace, resolving relative URLs to absolute, coercing types to the schema. It is configurable per blueprint; see Presets & item pipeline.


Next: Generate a blueprint →

Released under the MIT License.