What is DataHelm Crawler?
DataHelm Crawler is a generic, Scrapy-style web crawler delivered as a Laravel package (datahelm/crawler on Packagist). Rather than writing bespoke selectors for every site, you point it at a URL, let it auto-detect the page structure into an editable blueprint (plain JSON), then run that blueprint to get the scraped items as JSON, JSONL or CSV.
It is designed for listing/detail sites — product catalogs, auction platforms, real-estate portals, classifieds — where rows of items share a structure and may link to richer detail pages.
The two-step model
datahelm:scrap:generate datahelm:scrap:run
(auto-detect) (extract)
URL ───────────────────► blueprint.json ───────────────────► items.json
(editable) (or .jsonl / .csv)
- Generate — fetch the URL, detect the item list, pagination and field selectors, and emit a blueprint as JSON. See Generating blueprints.
- Run — load the blueprint, follow pagination, extract each item (and its detail page when configured), and write the results. See Running scrapes.
The blueprint is just data. Open the JSON and fix any selector the heuristics got wrong before running it.
What makes it different
- Generous auto-detection. It suggests as many fields as it can find in both the list and the detail page, so you keep what you want and delete the rest.
- Editable, transparent blueprints. No hidden state — the blueprint is human-readable JSON you can hand-tune, version, or embed in a robot command.
- Transport ladder for bot protection. From plain HTTP up to headless Chrome, FlareSolverr and managed scraping APIs, with an
automode that identifies the WAF vendor and escalates only as far as needed. See HTTP transports. - Handles modern sites. API mode for JSON-backed SPAs and infinite scroll for "Load more" listings.
- Production touches. Deduplication, conditional result filters, image downloading to any Laravel disk, streaming output, HTTP caching, proxy rotation, and AutoThrottle.
Where it runs
The crawler is a Laravel library: composer require datahelm/crawler works on its own with the default guzzle transport. The reference project ships a full Docker stack (nginx + PHP 8.4-FPM + PostgreSQL + Redis + Supervisor + browserless + FlareSolverr) so you can run everything locally. See the Docker stack reference.
How the project is organized
DataHelm is split across three repositories:
| Repository | Visibility | Contents |
|---|---|---|
| datahelm/crawler | Public (Packagist) | The Laravel package — composer require datahelm/crawler |
| datahelm/environment | Public | Full Docker stack to run DataHelm locally |
| DataHelmCrawler | Private / demo | Development sandbox: Laravel app, site-specific robots, examples |
See Publishing the package for how the public package is synced out of the sandbox.
Next: Getting started →

