HTTP transports & bot protection
Sites differ wildly in how hard they are to fetch — from plain HTML to JavaScript SPAs to aggressive bot-protection WAFs (Cloudflare, Akamai, PerimeterX). The crawler abstracts this behind a transport, selected with:
CRAWLER_TRANSPORT— global default in.env;- the
--transport=flag — also baked into the blueprint; - a blueprint's
http_config.transport.
The transports
| Transport | What it does | Cost | Use for |
|---|---|---|---|
auto (default) | Tries the cheapest transport, and on a detected block identifies the WAF vendor and escalates to the right stronger transport — then bakes the one that worked into the blueprint | free* | leave it on; most sites just work |
guzzle | Plain HTTP client (no JS, no anti-bot) | free | normal sites, APIs |
browser | Real headless Chrome via the browserless container (runs JS, real browser fingerprint) | free | JS/SPA sites, soft protections |
flaresolverr | Stealth Chromium tuned to solve Cloudflare / DDoS-Guard challenges, via the FlareSolverr container | free | Cloudflare-protected sites |
scraping_api | A managed provider (ZenRows/ScraperAPI/Zyte/…) that brings residential proxies + anti-bot solving | paid (your key) | the hardest WAFs (Akamai, PerimeterX) |
* auto only escalates to scraping_api when a key is configured; otherwise it stops with an honest message instead of producing a broken robot.
The free transports, in the simplest terms
| What it does | Speed | When you need it | |
|---|---|---|---|
guzzle | Plain HTTP request — just downloads the HTML | ⚡ Fastest, lightest | Normal sites where the data is already in the HTML |
browser | Opens the page in a real headless Chrome, runs the JavaScript | 🐢 Slower (boots Chromium) | Sites where content only appears after JS runs (React/Vue/SPA), or soft bot-checks |
flaresolverr | A stealth Chrome that specifically waits out Cloudflare "checking your browser…" pages | 🐢🐢 Slowest | Sites stuck behind a Cloudflare challenge screen |
Think of it as a ladder of effort: guzzle (cheap) → browser (heavier) → flaresolverr (heaviest). Each step up costs you speed and CPU, so you only climb when the cheaper one fails — which is exactly what auto does for you.
The escalation ladder (auto)
guzzle ─► browser ─► flaresolverr ─► scraping_api
fast JS Cloudflare hardest WAFs (paid)
On a block, auto is vendor-aware: Cloudflare → flaresolverr; Akamai / PerimeterX / DataDome / Kasada → straight to scraping_api (it skips the transports it knows can't help). The winning transport is baked into the generated robot, so subsequent runs go straight to it — no re-escalation.
Bot-protection detection
Every transport's response is inspected for WAF fingerprints (in headers and body, including challenge pages served with HTTP 200). When blocked, you get a clear message naming the vendor instead of a raw HTML dump, e.g.:
Blocked by Akamai Bot Manager (HTTP 403) — couldn't read https://…
… Even the headless browser was blocked — this firewall also scores IP
reputation, so a datacenter IP is flagged regardless of the browser. Options:
route through residential proxies; or capture the JSON API from your browser's
Network tab and re-run with --api-endpoint=<url>.
Detected vendors: Akamai, Cloudflare, PerimeterX/HUMAN, DataDome, Imperva/Incapsula, AWS WAF, Sucuri, Kasada.
Self-hosted transports (browserless & FlareSolverr)
Both are optional free Docker services. They are not required to install the package — only when you use the browser, flaresolverr, or auto transports.
Reference demo repo (docker-compose.yml):
docker compose up -d browserless flaresolverr # start when scraping protected sites
docker compose stop browserless flaresolverr # free RAM/CPU when done (they run a full Chromium)
Package users (minimal services only):
docker compose -f vendor/datahelm/crawler/docker/compose.services.yml up -d
# or, from a clone of datahelm/crawler:
docker compose -f docker/compose.services.yml up -d
Full stack — use datahelm/environment.
Config lives in config/crawler.php (browser, flaresolverr) with env overrides in .env (BROWSERLESS_URL, FLARESOLVERR_URL, FLARESOLVERR_MAX_TIMEOUT). See the Environment variables reference.
Which transport am I using right now?
Whatever CRAWLER_TRANSPORT is set to (and any robot that baked its own --transport overrides it for that robot). guzzle/browser/flaresolverr are all free; only scraping_api costs money.
Managed scraping API (scraping_api)
Vendor-agnostic — works with any provider following the GET <service>?url=&key=&flags convention. Configure in .env:
CRAWLER_TRANSPORT=scraping_api
SCRAPING_API_URL=https://api.zenrows.com/v1/
SCRAPING_API_KEY=your_key
SCRAPING_API_KEY_PARAM=apikey
SCRAPING_API_PARAMS=js_render=true&antibot=true&premium_proxy=true&proxy_country=br
See config/crawler.php for ready-made examples (ZenRows, ScraperAPI, ScrapingBee).
Free fallback for hard WAFs: replay your own session
When no free transport beats a WAF (Akamai, PerimeterX) and you don't want a paid key, you can reuse the session your own browser already passed — for a one-off run (cookies expire in hours):
- Open the page in your browser; in DevTools copy the relevant cookies (
_px3,_abck, session…) and key headers (User-Agent, Accept-Language). - Generate with
guzzle(exact replay) + the captured values:
php artisan datahelm:scrap:generate "<url>" \
--transport=guzzle \
--cookie="_px3=…; _pxhd=…" \
--header="user-agent: Mozilla/5.0 …" \
--header="accept-language: pt-BR,pt;q=0.9" \
--robot-name=example
The cookies/headers are baked into the blueprint and sent on every request.

