Run a scrape
Step 2 of the two-step workflow. Loads a blueprint, follows pagination, extracts each item (and its detail page when configured), and prints/saves the items as JSON.
php artisan datahelm:scrap:run <blueprint-path-or-host> [--limit=N] [--output=PATH]
The blueprint argument is either a file path, or a host name previously saved with --save during generation. --limit=N stops after N items.
Where output goes
By default the JSON is saved automatically to storage/app/scrapes/<name>.json, where <name> is derived from the command — e.g. datahelm:robot:exampleauctions writes storage/app/scrapes/exampleauctions.json (no flag needed). Progress (fetching: …) and the saved-path message go to stderr, so stdout stays clean for piping.
# auto-saves to storage/app/scrapes/exampleauctions.json
php artisan datahelm:robot:exampleauctions --limit=20
# choose a different path
php artisan datahelm:robot:exampleauctions --limit=20 --output=storage/app/scrapes/today.json
# print to stdout instead of saving (e.g. to pipe)
php artisan datahelm:robot:exampleauctions --limit=20 --output=- > exampleauctions.json
--limit and --output are available on datahelm:scrap:run and on every generated robot command.
Full example
php artisan datahelm:scrap:generate https://books.toscrape.com/ --get-detail=true --save
php artisan datahelm:scrap:run books.toscrape.com --limit=20
Output formats
Set in the blueprint's output_config.format:
| Format | Description |
|---|---|
json | Pretty-printed JSON array (default) |
jsonl | One JSON object per line — better for large crawls, easy to stream |
csv | Comma-separated; first row is headers; array values are JSON-encoded in their cell |
markdown | One Markdown section per item — LLM ingestion / RAG. See Markdown output |
flatten: true collapses nested arrays: saved_images[0] → saved_images_0. You can also exclude_fields and rename_fields. See the Blueprint reference.
Streaming output
Write items to disk as they are scraped instead of buffering everything in memory — useful for large crawls (thousands of items):
"output_config": {
"format": "jsonl",
"stream": true
}
Works with all formats. When streaming is active, the robot writes each item immediately after it is scraped; the output file grows in real time and you can tail it while the crawl runs.
Crawl stats
After every crawl a summary is printed to stderr automatically — no configuration needed:
--- Crawl stats ---
Items scraped : 24 in 8s (3.0/s)
Pages : 3 fetched, 0 failed
Detail pages : 24 fetched, 0 failed
Images : 48 saved, 0 failed
Cache : 0 hits, 27 misses
The --limit flag and streaming mode are both reflected correctly.
Throttling and caps
These keep crawls polite and bounded (all set in the blueprint):
crawl_config.delay_between_pages_ms/delay_between_items_ms— fixed delays.crawl_config.max_items— blueprint-level hard cap (0= no cap). The per-run--limitflag takes precedence when non-zero.http_config.delay_ms— delay between page requests; for auction-style sites, 300–500 ms avoids rate limiting.auto_throttle— dynamically adjusts the inter-page delay based on actual server latency (Scrapy's AutoThrottle). See the Blueprint reference.cache— caches raw page HTML to disk so re-runs replay from cache instead of hitting the live site, ideal while iterating on selectors. See the Blueprint reference.
Tail the application log
In the Docker stack, use the container name from docker ps (php_datahelm), not the image name:
docker exec -it php_datahelm tail -f /var/www/html/storage/logs/laravel.log
Next: Scaffold a robot →

