Skip to content
On this page

Run a scrape

Step 2 of the two-step workflow. Loads a blueprint, follows pagination, extracts each item (and its detail page when configured), and prints/saves the items as JSON.

bash
php artisan datahelm:scrap:run <blueprint-path-or-host> [--limit=N] [--output=PATH]

The blueprint argument is either a file path, or a host name previously saved with --save during generation. --limit=N stops after N items.

Where output goes

By default the JSON is saved automatically to storage/app/scrapes/<name>.json, where <name> is derived from the command — e.g. datahelm:robot:exampleauctions writes storage/app/scrapes/exampleauctions.json (no flag needed). Progress (fetching: …) and the saved-path message go to stderr, so stdout stays clean for piping.

bash
# auto-saves to storage/app/scrapes/exampleauctions.json
php artisan datahelm:robot:exampleauctions --limit=20

# choose a different path
php artisan datahelm:robot:exampleauctions --limit=20 --output=storage/app/scrapes/today.json

# print to stdout instead of saving (e.g. to pipe)
php artisan datahelm:robot:exampleauctions --limit=20 --output=- > exampleauctions.json

--limit and --output are available on datahelm:scrap:run and on every generated robot command.

Full example

bash
php artisan datahelm:scrap:generate https://books.toscrape.com/ --get-detail=true --save
php artisan datahelm:scrap:run books.toscrape.com --limit=20

Output formats

Set in the blueprint's output_config.format:

FormatDescription
jsonPretty-printed JSON array (default)
jsonlOne JSON object per line — better for large crawls, easy to stream
csvComma-separated; first row is headers; array values are JSON-encoded in their cell
markdownOne Markdown section per item — LLM ingestion / RAG. See Markdown output

flatten: true collapses nested arrays: saved_images[0]saved_images_0. You can also exclude_fields and rename_fields. See the Blueprint reference.

Streaming output

Write items to disk as they are scraped instead of buffering everything in memory — useful for large crawls (thousands of items):

json
"output_config": {
  "format": "jsonl",
  "stream": true
}

Works with all formats. When streaming is active, the robot writes each item immediately after it is scraped; the output file grows in real time and you can tail it while the crawl runs.

Crawl stats

After every crawl a summary is printed to stderr automatically — no configuration needed:

--- Crawl stats ---
  Items scraped : 24 in 8s (3.0/s)
  Pages         : 3 fetched, 0 failed
  Detail pages  : 24 fetched, 0 failed
  Images        : 48 saved, 0 failed
  Cache         : 0 hits, 27 misses

The --limit flag and streaming mode are both reflected correctly.

Throttling and caps

These keep crawls polite and bounded (all set in the blueprint):

  • crawl_config.delay_between_pages_ms / delay_between_items_ms — fixed delays.
  • crawl_config.max_items — blueprint-level hard cap (0 = no cap). The per-run --limit flag takes precedence when non-zero.
  • http_config.delay_ms — delay between page requests; for auction-style sites, 300–500 ms avoids rate limiting.
  • auto_throttle — dynamically adjusts the inter-page delay based on actual server latency (Scrapy's AutoThrottle). See the Blueprint reference.
  • cache — caches raw page HTML to disk so re-runs replay from cache instead of hitting the live site, ideal while iterating on selectors. See the Blueprint reference.

Tail the application log

In the Docker stack, use the container name from docker ps (php_datahelm), not the image name:

bash
docker exec -it php_datahelm tail -f /var/www/html/storage/logs/laravel.log

Next: Scaffold a robot →

Released under the MIT License.