Skip to content
On this page

What is DataHelm Crawler?

DataHelm Crawler is a generic, Scrapy-style web crawler delivered as a Laravel package (datahelm/crawler on Packagist). Rather than writing bespoke selectors for every site, you point it at a URL, let it auto-detect the page structure into an editable blueprint (plain JSON), then run that blueprint to get the scraped items as JSON, JSONL or CSV.

It is designed for listing/detail sites — product catalogs, auction platforms, real-estate portals, classifieds — where rows of items share a structure and may link to richer detail pages.

The two-step model

   datahelm:scrap:generate              datahelm:scrap:run
        (auto-detect)                       (extract)
 URL ───────────────────► blueprint.json ───────────────────► items.json
                          (editable)                           (or .jsonl / .csv)
  1. Generate — fetch the URL, detect the item list, pagination and field selectors, and emit a blueprint as JSON. See Generating blueprints.
  2. Run — load the blueprint, follow pagination, extract each item (and its detail page when configured), and write the results. See Running scrapes.

The blueprint is just data. Open the JSON and fix any selector the heuristics got wrong before running it.

What makes it different

  • Generous auto-detection. It suggests as many fields as it can find in both the list and the detail page, so you keep what you want and delete the rest.
  • Editable, transparent blueprints. No hidden state — the blueprint is human-readable JSON you can hand-tune, version, or embed in a robot command.
  • Transport ladder for bot protection. From plain HTTP up to headless Chrome, FlareSolverr and managed scraping APIs, with an auto mode that identifies the WAF vendor and escalates only as far as needed. See HTTP transports.
  • Handles modern sites. API mode for JSON-backed SPAs and infinite scroll for "Load more" listings.
  • Production touches. Deduplication, conditional result filters, image downloading to any Laravel disk, streaming output, HTTP caching, proxy rotation, and AutoThrottle.

Where it runs

The crawler is a Laravel library: composer require datahelm/crawler works on its own with the default guzzle transport. The reference project ships a full Docker stack (nginx + PHP 8.4-FPM + PostgreSQL + Redis + Supervisor + browserless + FlareSolverr) so you can run everything locally. See the Docker stack reference.

How the project is organized

DataHelm is split across three repositories:

RepositoryVisibilityContents
datahelm/crawlerPublic (Packagist)The Laravel package — composer require datahelm/crawler
datahelm/environmentPublicFull Docker stack to run DataHelm locally
DataHelmCrawlerPrivate / demoDevelopment sandbox: Laravel app, site-specific robots, examples

See Publishing the package for how the public package is synced out of the sandbox.


Next: Getting started →

Released under the MIT License.