Auto-detection
Detects the item list, pagination and field selectors (link, title, image, price, rating, address, description) from any listing page β generously, so you keep what you want and delete the rest.
Scrapy-style web crawling for Laravel
Point it at a URL, let it auto-detect the structure into an editable blueprint, then run that blueprint to get clean JSON. No per-site selector hand-coding.
Detects the item list, pagination and field selectors (link, title, image, price, rating, address, description) from any listing page β generously, so you keep what you want and delete the rest.
Detection emits a plain, human-readable JSON blueprint. Fix any selector the heuristics got wrong, then run it. Nothing magic, nothing hidden.
Transports ladder from plain HTTP up to headless Chrome, FlareSolverr and managed APIs. The auto transport identifies the WAF vendor and escalates only as far as needed.
When a JavaScript site has no server HTML, API mode calls the JSON endpoint directly and reads fields by dot-path β faster and more reliable than rendering a browser.
Handles "Load more" listings that return HTML fragments, replaying the endpoint with an incrementing offset and a scraped CSRF token.
Resolve primary / gallery / all image URLs, download to any Laravel disk, deduplicate by key field, and keep only the items that pass your result filters.
The Firecrawl / Crawl4AI feature in Laravel β render any element as clean Markdown with a field type, or export the whole crawl as one Markdown document for RAG ingestion.
DataHelm Crawler is a generic, two-step scraping subsystem for Laravel β installable with composer require datahelm/crawler. Instead of hand-coding selectors per site, you:
datahelm:scrap:generate) β auto-detection writes editable JSON describing the list, pagination and fields.datahelm:scrap:run, or a scaffolded datahelm:robot:{name}) to extract the items as JSON / JSONL / CSV / LLM-ready Markdown.The default transport is plain HTTP (guzzle) and needs no extra infrastructure. Optional Docker services (browserless, FlareSolverr) add JavaScript rendering and Cloudflare solving only when a site demands it.
# 1 β auto-detect a listing and scaffold a robot
php artisan datahelm:scrap:generate "https://books.toscrape.com/" --get-detail=true --robot
# 2 β run it, capped at 20 items
php artisan datahelm:robot:books --limit=20
This documentation is generated from the DataHelm Crawler project README and source. Start with Getting started or browse Core concepts.