Skip to content
On this page

Architecture

The crawler is built from small, single-responsibility pieces behind interfaces, wired in the package's service provider (CrawlerServiceProvider) — and overridable in your app's AppServiceProvider. The four seams are transport, detection, processing pipeline, and output.

The four seams

SeamInterfaceRole
TransportHttpClientFetches a URL — guzzle, browser, flaresolverr, scraping_api, auto, plus caching/auto wrappers
Detection*DetectorHeuristics that find the list, pagination and each field type
PipelineItemProcessorPost-processes each scraped item (trim, absolute URLs, schema coercion)
OutputItemExporter / OutputSinkSerializes (json/jsonl/csv) and writes (file, database, queue, webhook, callback)

Because each is an interface, swapping behavior — e.g. a new headless transport — means binding one class, with no other code changes.

Source layout (DataHelm\Crawler\)

NamespaceResponsibility
Http\Transports: GuzzleHttpClient, BrowserHttpClient, FlareSolverrHttpClient, BrowserlessHttpClient, ScrapingApiHttpClient, AutoHttpClient, CachedHttpClient, TransportFactory
Detection\Field detectors (Title, Price, Image, Link, Rating, Address, Description, Gallery, Labeled), ListDetector, PaginationDetector, SpaDetector, BotProtectionDetector, BlueprintGenerator, BlueprintValidator
Blueprint\The blueprint value objects: ScrapeBlueprint, FieldSelector, CrawlConfig, HttpConfig, OutputConfig, DedupConfig, FiltersConfig, CacheConfig, AutoThrottleConfig, InfiniteScrollConfig, ApiConfig, SearchFilter, BlueprintBuilder
Scraping\The engine: CrawlEngine, ApiCrawler, ItemExtractor, JsonItemExtractor, Paginator, CrawlState, CrawlStats, ItemSink, ScrapedItem
Pipeline\ItemPipeline + processors: TrimProcessor, AbsoluteUrlProcessor, SchemaCoercionProcessor
Output\Exporters (Json, Jsonl, Csv) and sinks (File, JsonFile, Database, Queue, Webhook, Callback), StreamWriter
Media\ImageStore (download to a disk), ItemImageResolver (pick the primary image)
Console\The Artisan commands: RunScrapCommand, GenerateBlueprintCommand, ValidateBlueprintCommand, ShellCommand
Scaffolding\RobotCommandScaffolder — generates Robot{Name} command files

How a crawl flows

 URL ─► HttpClient ─► HTML/JSON ─► ItemExtractor ─► ItemPipeline ─► OutputSink ─► json/jsonl/csv
          (transport)              (selectors /     (trim, abs-url,   (file / db /
                                    dot-paths)       coercion)         callback)
            ▲                          │
            └──── Paginator ───────────┘   (next page until max_pages / limit)
  • Static-HTML sites work today. Price detection defaults to R$ (Brazilian Real) and is configurable.
  • JS/AJAX sites backed by a JSON API (e.g. Copart) are handled by API mode (mode: "api"): the engine calls the endpoint directly via ApiCrawler and GuzzleHttpClient::request(), extracting fields by JSON dot-path. See API mode.
  • Infinite-scroll listings that return HTML fragments (e.g. Portal Zuk) are handled by the infinite_scroll pagination strategy: the engine POSTs an incrementing offset (plus a CSRF token scraped from page one, reusing the session cookie) and parses each fragment with the same item_selector. See Infinite scroll.
  • Fully client-rendered sites with no usable API would need a headless transport: add a new HttpClient implementation (e.g. Playwright / Symfony Panther) and bind it in AppServiceProvider — no other class changes.

Extending detection

Field detectors are configurable in config/crawler.php under detectors. Add your own detector class to teach the generator domain-specific fields, and it will be suggested in generated blueprints alongside the built-ins.

Released under the MIT License.