Architecture

The crawler is built from small, single-responsibility pieces behind interfaces, wired in the package's service provider (CrawlerServiceProvider) — and overridable in your app's AppServiceProvider. The four seams are transport, detection, processing pipeline, and output.

The four seams

Seam	Interface	Role
Transport	`HttpClient`	Fetches a URL — `guzzle`, `browser`, `flaresolverr`, `scraping_api`, `auto`, plus caching/auto wrappers
Detection	`*Detector`	Heuristics that find the list, pagination and each field type
Pipeline	`ItemProcessor`	Post-processes each scraped item (trim, absolute URLs, schema coercion)
Output	`ItemExporter` / `OutputSink`	Serializes (json/jsonl/csv) and writes (file, database, queue, webhook, callback)

Because each is an interface, swapping behavior — e.g. a new headless transport — means binding one class, with no other code changes.

Source layout (`DataHelm\Crawler\`)

Namespace	Responsibility
`Http\`	Transports: `GuzzleHttpClient`, `BrowserHttpClient`, `FlareSolverrHttpClient`, `BrowserlessHttpClient`, `ScrapingApiHttpClient`, `AutoHttpClient`, `CachedHttpClient`, `TransportFactory`
`Detection\`	Field detectors (`Title`, `Price`, `Image`, `Link`, `Rating`, `Address`, `Description`, `Gallery`, `Labeled`), `ListDetector`, `PaginationDetector`, `SpaDetector`, `BotProtectionDetector`, `BlueprintGenerator`, `BlueprintValidator`
`Blueprint\`	The blueprint value objects: `ScrapeBlueprint`, `FieldSelector`, `CrawlConfig`, `HttpConfig`, `OutputConfig`, `DedupConfig`, `FiltersConfig`, `CacheConfig`, `AutoThrottleConfig`, `InfiniteScrollConfig`, `ApiConfig`, `SearchFilter`, `BlueprintBuilder`
`Scraping\`	The engine: `CrawlEngine`, `ApiCrawler`, `ItemExtractor`, `JsonItemExtractor`, `Paginator`, `CrawlState`, `CrawlStats`, `ItemSink`, `ScrapedItem`
`Pipeline\`	`ItemPipeline` + processors: `TrimProcessor`, `AbsoluteUrlProcessor`, `SchemaCoercionProcessor`
`Output\`	Exporters (`Json`, `Jsonl`, `Csv`) and sinks (`File`, `JsonFile`, `Database`, `Queue`, `Webhook`, `Callback`), `StreamWriter`
`Media\`	`ImageStore` (download to a disk), `ItemImageResolver` (pick the primary image)
`Console\`	The Artisan commands: `RunScrapCommand`, `GenerateBlueprintCommand`, `ValidateBlueprintCommand`, `ShellCommand`
`Scaffolding\`	`RobotCommandScaffolder` — generates `Robot{Name}` command files

How a crawl flows

 URL ─► HttpClient ─► HTML/JSON ─► ItemExtractor ─► ItemPipeline ─► OutputSink ─► json/jsonl/csv
          (transport)              (selectors /     (trim, abs-url,   (file / db /
                                    dot-paths)       coercion)         callback)
            ▲                          │
            └──── Paginator ───────────┘   (next page until max_pages / limit)

Static-HTML sites work today. Price detection defaults to R$ (Brazilian Real) and is configurable.
JS/AJAX sites backed by a JSON API (e.g. Copart) are handled by API mode (mode: "api"): the engine calls the endpoint directly via ApiCrawler and GuzzleHttpClient::request(), extracting fields by JSON dot-path. See API mode.
Infinite-scroll listings that return HTML fragments (e.g. Portal Zuk) are handled by the infinite_scroll pagination strategy: the engine POSTs an incrementing offset (plus a CSRF token scraped from page one, reusing the session cookie) and parses each fragment with the same item_selector. See Infinite scroll.
Fully client-rendered sites with no usable API would need a headless transport: add a new HttpClient implementation (e.g. Playwright / Symfony Panther) and bind it in AppServiceProvider — no other class changes.

Extending detection

Field detectors are configurable in config/crawler.php under detectors. Add your own detector class to teach the generator domain-specific fields, and it will be suggested in generated blueprints alongside the built-ins.

Architecture #

The four seams #

Source layout (DataHelm\Crawler\) #

How a crawl flows #

Extending detection #

Architecture

The four seams

Source layout (`DataHelm\Crawler\`)

How a crawl flows

Extending detection