HTTP transports & bot protection

Sites differ wildly in how hard they are to fetch — from plain HTML to JavaScript SPAs to aggressive bot-protection WAFs (Cloudflare, Akamai, PerimeterX). The crawler abstracts this behind a transport, selected with:

CRAWLER_TRANSPORT — global default in .env;
the --transport= flag — also baked into the blueprint;
a blueprint's http_config.transport.

The transports

Transport	What it does	Cost	Use for
`auto` (default)	Tries the cheapest transport, and on a detected block identifies the WAF vendor and escalates to the right stronger transport — then bakes the one that worked into the blueprint	free*	leave it on; most sites just work
`guzzle`	Plain HTTP client (no JS, no anti-bot)	free	normal sites, APIs
`browser`	Real headless Chrome via the browserless container (runs JS, real browser fingerprint)	free	JS/SPA sites, soft protections
`flaresolverr`	Stealth Chromium tuned to solve Cloudflare / DDoS-Guard challenges, via the FlareSolverr container	free	Cloudflare-protected sites
`scraping_api`	A managed provider (ZenRows/ScraperAPI/Zyte/…) that brings residential proxies + anti-bot solving	paid (your key)	the hardest WAFs (Akamai, PerimeterX)

* auto only escalates to scraping_api when a key is configured; otherwise it stops with an honest message instead of producing a broken robot.

The free transports, in the simplest terms

	What it does	Speed	When you need it
`guzzle`	Plain HTTP request — just downloads the HTML	⚡ Fastest, lightest	Normal sites where the data is already in the HTML
`browser`	Opens the page in a real headless Chrome, runs the JavaScript	🐢 Slower (boots Chromium)	Sites where content only appears after JS runs (React/Vue/SPA), or soft bot-checks
`flaresolverr`	A stealth Chrome that specifically waits out Cloudflare "checking your browser…" pages	🐢🐢 Slowest	Sites stuck behind a Cloudflare challenge screen

Think of it as a ladder of effort: guzzle (cheap) → browser (heavier) → flaresolverr (heaviest). Each step up costs you speed and CPU, so you only climb when the cheaper one fails — which is exactly what auto does for you.

The escalation ladder (`auto`)

guzzle ─► browser ─► flaresolverr ─► scraping_api
 fast       JS         Cloudflare       hardest WAFs (paid)

On a block, auto is vendor-aware: Cloudflare → flaresolverr; Akamai / PerimeterX / DataDome / Kasada → straight to scraping_api (it skips the transports it knows can't help). The winning transport is baked into the generated robot, so subsequent runs go straight to it — no re-escalation.

Bot-protection detection

Every transport's response is inspected for WAF fingerprints (in headers and body, including challenge pages served with HTTP 200). When blocked, you get a clear message naming the vendor instead of a raw HTML dump, e.g.:

Blocked by Akamai Bot Manager (HTTP 403) — couldn't read https://…
  … Even the headless browser was blocked — this firewall also scores IP
  reputation, so a datacenter IP is flagged regardless of the browser. Options:
  route through residential proxies; or capture the JSON API from your browser's
  Network tab and re-run with --api-endpoint=<url>.

Detected vendors: Akamai, Cloudflare, PerimeterX/HUMAN, DataDome, Imperva/Incapsula, AWS WAF, Sucuri, Kasada.

Self-hosted transports (browserless & FlareSolverr)

Both are optional free Docker services. They are not required to install the package — only when you use the browser, flaresolverr, or auto transports.

Reference demo repo (docker-compose.yml):

bash

docker compose up -d browserless flaresolverr   # start when scraping protected sites
docker compose stop browserless flaresolverr    # free RAM/CPU when done (they run a full Chromium)

Package users (minimal services only):

bash

docker compose -f vendor/datahelm/crawler/docker/compose.services.yml up -d
# or, from a clone of datahelm/crawler:
docker compose -f docker/compose.services.yml up -d

Full stack — use datahelm/environment.

Config lives in config/crawler.php (browser, flaresolverr) with env overrides in .env (BROWSERLESS_URL, FLARESOLVERR_URL, FLARESOLVERR_MAX_TIMEOUT). See the Environment variables reference.

Which transport am I using right now?

Whatever CRAWLER_TRANSPORT is set to (and any robot that baked its own --transport overrides it for that robot). guzzle/browser/flaresolverr are all free; only scraping_api costs money.

Managed scraping API (`scraping_api`)

Vendor-agnostic — works with any provider following the GET <service>?url=&key=&flags convention. Configure in .env:

ini

CRAWLER_TRANSPORT=scraping_api
SCRAPING_API_URL=https://api.zenrows.com/v1/
SCRAPING_API_KEY=your_key
SCRAPING_API_KEY_PARAM=apikey
SCRAPING_API_PARAMS=js_render=true&antibot=true&premium_proxy=true&proxy_country=br

See config/crawler.php for ready-made examples (ZenRows, ScraperAPI, ScrapingBee).

Free fallback for hard WAFs: replay your own session

When no free transport beats a WAF (Akamai, PerimeterX) and you don't want a paid key, you can reuse the session your own browser already passed — for a one-off run (cookies expire in hours):

Open the page in your browser; in DevTools copy the relevant cookies (_px3, _abck, session…) and key headers (User-Agent, Accept-Language).
Generate with guzzle (exact replay) + the captured values:

bash

php artisan datahelm:scrap:generate "<url>" \
  --transport=guzzle \
  --cookie="_px3=…; _pxhd=…" \
  --header="user-agent: Mozilla/5.0 …" \
  --header="accept-language: pt-BR,pt;q=0.9" \
  --robot-name=example

The cookies/headers are baked into the blueprint and sent on every request.

Next: JavaScript sites & JSON APIs →

HTTP transports & bot protection #

The transports #

The free transports, in the simplest terms #

The escalation ladder (auto) #

Bot-protection detection #

Self-hosted transports (browserless & FlareSolverr) #

Managed scraping API (scraping_api) #

Free fallback for hard WAFs: replay your own session #

HTTP transports & bot protection

The transports

The free transports, in the simplest terms

The escalation ladder (`auto`)

Bot-protection detection

Self-hosted transports (browserless & FlareSolverr)

Managed scraping API (`scraping_api`)

Free fallback for hard WAFs: replay your own session