Skip to content
On this page

Presets & item pipeline

Two configuration systems that sound similar (both pick a named config by string) but act at opposite ends of the process: presets shape how fields are found during generation; the pipeline shapes what happens to their values on every run.

PresetsPipeline
Runs duringscrap:generate only, oncescrap:run, every execution
Acts onDetection heuristics — finding the right selectorsExtracted values — transforming them after extraction
Selected via--preset=ecommerce on the CLI"pipeline_names": [...] in the blueprint JSON
Outlives the run?No — only its effect on the generated blueprint persistsYes — read from the blueprint on every future run

Presets (field-detection heuristics)

A preset is a named bundle of heuristics used only by datahelm:scrap:generate — the auto-detection step that guesses selectors on a site it has never seen. It has no effect once a blueprint is saved.

bash
php artisan datahelm:scrap:generate "https://example.com/listing" --preset=ecommerce

Built-in presets (config/crawler.phppresets), selected with --preset= or the CRAWLER_PRESET env var (default generic):

PresetUse for
genericUnsure / mixed content — safe default for any country, any vertical
ecommerceOnline shops, marketplaces (adds handle, product-image, qty, …)
auctionsAuction/lot listings
propertiesReal-estate listings

Each preset is an array of hints the detectors match against CSS classes, HTML attributes and JSON field names:

KeyControls
price_patternsRegexes for currency formats ($, R$, , £, …) — locale-specific
image_field_hintsCSS class / JSON key fragments marking an image field (image, thumb, gallery, …)
link_field_hintsSame, for the item URL field (url, href, slug, handle, …)
rating_hintsCSS class fragments for star/score widgets
stock_hintsCSS class fragments for availability/inventory
sku_hintsCSS class / JSON key fragments for product codes (SKU, EAN, MPN, …)
image_path_prefixFixed URL path segment identifying image URLs on a known platform (e.g. VTEX's /arquivos/); null = auto
list_* thresholdsWhen a repeating block counts as a real item list
item_schemaSuggested type-coercion map carried into the generated blueprint

Most hint lists are CSS-class vocabulary and stay in English regardless of the page's display language — developers write class="star-rating" on French/Portuguese/Arabic sites alike. Only price_patterns and image_path_prefix are locale/platform specific.

Adding your own preset

Extend an existing one by merging in local vocabulary, in the published config/crawler.php:

php
'presets' => [
    // ...
    'auctions_pt_BR' => [
        'price_patterns'    => ['/R\$\s*[\d.,]+/'],
        'image_field_hints' => array_merge(
            ['image', 'img', 'photo', 'thumb'],
            ['foto', 'fotos', 'imagem', 'imagens', 'galeria'],
        ),
        'link_field_hints'  => ['url', 'link', 'href', 'permalink', 'lote'],
        'image_path_prefix' => '/arquivos/',
    ],
],
bash
php artisan datahelm:scrap:generate <url> --preset=auctions_pt_BR

Item pipeline

The pipeline runs on every crawl execution, transforming each already-extracted item before it is exported. By default (config/crawler.phppipeline) every item passes through:

  1. TrimProcessor — collapses whitespace and trims every string field.
  2. AbsoluteUrlProcessor — resolves relative link / image / gallery_images / … URLs against the page they were scraped from.

Overriding per blueprint: pipeline_names

A blueprint can replace the global pipeline for itself with pipeline_names — a list of short names resolved against config('crawler.pipeline_registry'):

php
'pipeline_registry' => [
    'trim'            => TrimProcessor::class,
    'absolute_url'    => AbsoluteUrlProcessor::class,
    'schema_coercion' => SchemaCoercionProcessor::class,
],
json
{ "pipeline_names": ["trim", "schema_coercion"] }

Replacement, not addition

Listing ["trim"] runs only TrimProcessor for that blueprint — AbsoluteUrlProcessor no longer runs, so relative URLs are left as-is. Leaving pipeline_names empty ([], the default) keeps the global pipeline untouched — most blueprints never need to set this.

Adding a custom processor

Implement ItemProcessor, register it, then reference it by name:

php
use DataHelm\Crawler\Pipeline\ItemProcessor;
use DataHelm\Crawler\Scraping\ScrapedItem;

final class StripEmojiProcessor implements ItemProcessor
{
    public function process(ScrapedItem $item, string $pageUrl): ScrapedItem
    {
        $title = $item->get('title');
        if (is_string($title)) {
            $item->set('title', preg_replace('/[\x{1F300}-\x{1FAFF}]/u', '', $title));
        }

        return $item;
    }
}
php
// config/crawler.php
'pipeline_registry' => [
    // ...
    'strip_emoji' => \App\Pipeline\StripEmojiProcessor::class,
],
json
{ "pipeline_names": ["trim", "absolute_url", "strip_emoji"] }

Next: Result filters →

Released under the MIT License.