Skip to content

DataHelm Crawler

Scrapy-style web crawling for Laravel

Point it at a URL, let it auto-detect the structure into an editable blueprint, then run that blueprint to get clean JSON. No per-site selector hand-coding.

🧭

Auto-detection

Detects the item list, pagination and field selectors (link, title, image, price, rating, address, description) from any listing page β€” generously, so you keep what you want and delete the rest.

πŸ“‹

Editable blueprints

Detection emits a plain, human-readable JSON blueprint. Fix any selector the heuristics got wrong, then run it. Nothing magic, nothing hidden.

πŸ›‘οΈ

Bot-protection aware

Transports ladder from plain HTTP up to headless Chrome, FlareSolverr and managed APIs. The auto transport identifies the WAF vendor and escalates only as far as needed.

βš™οΈ

JSON APIs & SPAs

When a JavaScript site has no server HTML, API mode calls the JSON endpoint directly and reads fields by dot-path β€” faster and more reliable than rendering a browser.

♾️

Infinite scroll

Handles "Load more" listings that return HTML fragments, replaying the endpoint with an incrementing offset and a scraped CSRF token.

πŸ–ΌοΈ

Images, dedup & filters

Resolve primary / gallery / all image URLs, download to any Laravel disk, deduplicate by key field, and keep only the items that pass your result filters.

πŸ“

LLM-ready Markdown

The Firecrawl / Crawl4AI feature in Laravel β€” render any element as clean Markdown with a field type, or export the whole crawl as one Markdown document for RAG ingestion.

What it is

DataHelm Crawler is a generic, two-step scraping subsystem for Laravel β€” installable with composer require datahelm/crawler. Instead of hand-coding selectors per site, you:

  1. Generate a blueprint from a URL (datahelm:scrap:generate) β€” auto-detection writes editable JSON describing the list, pagination and fields.
  2. Run that blueprint (datahelm:scrap:run, or a scaffolded datahelm:robot:{name}) to extract the items as JSON / JSONL / CSV / LLM-ready Markdown.

The default transport is plain HTTP (guzzle) and needs no extra infrastructure. Optional Docker services (browserless, FlareSolverr) add JavaScript rendering and Cloudflare solving only when a site demands it.

bash
# 1 β€” auto-detect a listing and scaffold a robot
php artisan datahelm:scrap:generate "https://books.toscrape.com/" --get-detail=true --robot

# 2 β€” run it, capped at 20 items
php artisan datahelm:robot:books --limit=20

This documentation is generated from the DataHelm Crawler project README and source. Start with Getting started or browse Core concepts.

Released under the MIT License.