Skip to content
On this page

HTTP transports & bot protection

Sites differ wildly in how hard they are to fetch — from plain HTML to JavaScript SPAs to aggressive bot-protection WAFs (Cloudflare, Akamai, PerimeterX). The crawler abstracts this behind a transport, selected with:

  • CRAWLER_TRANSPORT — global default in .env;
  • the --transport= flag — also baked into the blueprint;
  • a blueprint's http_config.transport.

The transports

TransportWhat it doesCostUse for
auto (default)Tries the cheapest transport, and on a detected block identifies the WAF vendor and escalates to the right stronger transport — then bakes the one that worked into the blueprintfree*leave it on; most sites just work
guzzlePlain HTTP client (no JS, no anti-bot)freenormal sites, APIs
browserReal headless Chrome via the browserless container (runs JS, real browser fingerprint)freeJS/SPA sites, soft protections
flaresolverrStealth Chromium tuned to solve Cloudflare / DDoS-Guard challenges, via the FlareSolverr containerfreeCloudflare-protected sites
scraping_apiA managed provider (ZenRows/ScraperAPI/Zyte/…) that brings residential proxies + anti-bot solvingpaid (your key)the hardest WAFs (Akamai, PerimeterX)

* auto only escalates to scraping_api when a key is configured; otherwise it stops with an honest message instead of producing a broken robot.

The free transports, in the simplest terms

What it doesSpeedWhen you need it
guzzlePlain HTTP request — just downloads the HTML⚡ Fastest, lightestNormal sites where the data is already in the HTML
browserOpens the page in a real headless Chrome, runs the JavaScript🐢 Slower (boots Chromium)Sites where content only appears after JS runs (React/Vue/SPA), or soft bot-checks
flaresolverrA stealth Chrome that specifically waits out Cloudflare "checking your browser…" pages🐢🐢 SlowestSites stuck behind a Cloudflare challenge screen

Think of it as a ladder of effort: guzzle (cheap) → browser (heavier) → flaresolverr (heaviest). Each step up costs you speed and CPU, so you only climb when the cheaper one fails — which is exactly what auto does for you.

The escalation ladder (auto)

guzzle ─► browser ─► flaresolverr ─► scraping_api
 fast       JS         Cloudflare       hardest WAFs (paid)

On a block, auto is vendor-aware: Cloudflare → flaresolverr; Akamai / PerimeterX / DataDome / Kasada → straight to scraping_api (it skips the transports it knows can't help). The winning transport is baked into the generated robot, so subsequent runs go straight to it — no re-escalation.

Bot-protection detection

Every transport's response is inspected for WAF fingerprints (in headers and body, including challenge pages served with HTTP 200). When blocked, you get a clear message naming the vendor instead of a raw HTML dump, e.g.:

Blocked by Akamai Bot Manager (HTTP 403) — couldn't read https://…
  … Even the headless browser was blocked — this firewall also scores IP
  reputation, so a datacenter IP is flagged regardless of the browser. Options:
  route through residential proxies; or capture the JSON API from your browser's
  Network tab and re-run with --api-endpoint=<url>.

Detected vendors: Akamai, Cloudflare, PerimeterX/HUMAN, DataDome, Imperva/Incapsula, AWS WAF, Sucuri, Kasada.

Self-hosted transports (browserless & FlareSolverr)

Both are optional free Docker services. They are not required to install the package — only when you use the browser, flaresolverr, or auto transports.

Reference demo repo (docker-compose.yml):

bash
docker compose up -d browserless flaresolverr   # start when scraping protected sites
docker compose stop browserless flaresolverr    # free RAM/CPU when done (they run a full Chromium)

Package users (minimal services only):

bash
docker compose -f vendor/datahelm/crawler/docker/compose.services.yml up -d
# or, from a clone of datahelm/crawler:
docker compose -f docker/compose.services.yml up -d

Full stack — use datahelm/environment.

Config lives in config/crawler.php (browser, flaresolverr) with env overrides in .env (BROWSERLESS_URL, FLARESOLVERR_URL, FLARESOLVERR_MAX_TIMEOUT). See the Environment variables reference.

Which transport am I using right now?

Whatever CRAWLER_TRANSPORT is set to (and any robot that baked its own --transport overrides it for that robot). guzzle/browser/flaresolverr are all free; only scraping_api costs money.

Managed scraping API (scraping_api)

Vendor-agnostic — works with any provider following the GET <service>?url=&key=&flags convention. Configure in .env:

ini
CRAWLER_TRANSPORT=scraping_api
SCRAPING_API_URL=https://api.zenrows.com/v1/
SCRAPING_API_KEY=your_key
SCRAPING_API_KEY_PARAM=apikey
SCRAPING_API_PARAMS=js_render=true&antibot=true&premium_proxy=true&proxy_country=br

See config/crawler.php for ready-made examples (ZenRows, ScraperAPI, ScrapingBee).

Free fallback for hard WAFs: replay your own session

When no free transport beats a WAF (Akamai, PerimeterX) and you don't want a paid key, you can reuse the session your own browser already passed — for a one-off run (cookies expire in hours):

  1. Open the page in your browser; in DevTools copy the relevant cookies (_px3, _abck, session…) and key headers (User-Agent, Accept-Language).
  2. Generate with guzzle (exact replay) + the captured values:
bash
php artisan datahelm:scrap:generate "<url>" \
  --transport=guzzle \
  --cookie="_px3=…; _pxhd=…" \
  --header="user-agent: Mozilla/5.0 …" \
  --header="accept-language: pt-BR,pt;q=0.9" \
  --robot-name=example

The cookies/headers are baked into the blueprint and sent on every request.


Next: JavaScript sites & JSON APIs →

Released under the MIT License.