Architecture
The crawler is built from small, single-responsibility pieces behind interfaces, wired in the package's service provider (CrawlerServiceProvider) — and overridable in your app's AppServiceProvider. The four seams are transport, detection, processing pipeline, and output.
The four seams
| Seam | Interface | Role |
|---|---|---|
| Transport | HttpClient | Fetches a URL — guzzle, browser, flaresolverr, scraping_api, auto, plus caching/auto wrappers |
| Detection | *Detector | Heuristics that find the list, pagination and each field type |
| Pipeline | ItemProcessor | Post-processes each scraped item (trim, absolute URLs, schema coercion) |
| Output | ItemExporter / OutputSink | Serializes (json/jsonl/csv) and writes (file, database, queue, webhook, callback) |
Because each is an interface, swapping behavior — e.g. a new headless transport — means binding one class, with no other code changes.
Source layout (DataHelm\Crawler\)
| Namespace | Responsibility |
|---|---|
Http\ | Transports: GuzzleHttpClient, BrowserHttpClient, FlareSolverrHttpClient, BrowserlessHttpClient, ScrapingApiHttpClient, AutoHttpClient, CachedHttpClient, TransportFactory |
Detection\ | Field detectors (Title, Price, Image, Link, Rating, Address, Description, Gallery, Labeled), ListDetector, PaginationDetector, SpaDetector, BotProtectionDetector, BlueprintGenerator, BlueprintValidator |
Blueprint\ | The blueprint value objects: ScrapeBlueprint, FieldSelector, CrawlConfig, HttpConfig, OutputConfig, DedupConfig, FiltersConfig, CacheConfig, AutoThrottleConfig, InfiniteScrollConfig, ApiConfig, SearchFilter, BlueprintBuilder |
Scraping\ | The engine: CrawlEngine, ApiCrawler, ItemExtractor, JsonItemExtractor, Paginator, CrawlState, CrawlStats, ItemSink, ScrapedItem |
Pipeline\ | ItemPipeline + processors: TrimProcessor, AbsoluteUrlProcessor, SchemaCoercionProcessor |
Output\ | Exporters (Json, Jsonl, Csv) and sinks (File, JsonFile, Database, Queue, Webhook, Callback), StreamWriter |
Media\ | ImageStore (download to a disk), ItemImageResolver (pick the primary image) |
Console\ | The Artisan commands: RunScrapCommand, GenerateBlueprintCommand, ValidateBlueprintCommand, ShellCommand |
Scaffolding\ | RobotCommandScaffolder — generates Robot{Name} command files |
How a crawl flows
URL ─► HttpClient ─► HTML/JSON ─► ItemExtractor ─► ItemPipeline ─► OutputSink ─► json/jsonl/csv
(transport) (selectors / (trim, abs-url, (file / db /
dot-paths) coercion) callback)
▲ │
└──── Paginator ───────────┘ (next page until max_pages / limit)
- Static-HTML sites work today. Price detection defaults to
R$(Brazilian Real) and is configurable. - JS/AJAX sites backed by a JSON API (e.g. Copart) are handled by API mode (
mode: "api"): the engine calls the endpoint directly viaApiCrawlerandGuzzleHttpClient::request(), extracting fields by JSON dot-path. See API mode. - Infinite-scroll listings that return HTML fragments (e.g. Portal Zuk) are handled by the
infinite_scrollpagination strategy: the engine POSTs an incrementing offset (plus a CSRF token scraped from page one, reusing the session cookie) and parses each fragment with the sameitem_selector. See Infinite scroll. - Fully client-rendered sites with no usable API would need a headless transport: add a new
HttpClientimplementation (e.g. Playwright / Symfony Panther) and bind it inAppServiceProvider— no other class changes.
Extending detection
Field detectors are configurable in config/crawler.php under detectors. Add your own detector class to teach the generator domain-specific fields, and it will be suggested in generated blueprints alongside the built-ins.

