Presets & item pipeline
Two configuration systems that sound similar (both pick a named config by string) but act at opposite ends of the process: presets shape how fields are found during generation; the pipeline shapes what happens to their values on every run.
| Presets | Pipeline | |
|---|---|---|
| Runs during | scrap:generate only, once | scrap:run, every execution |
| Acts on | Detection heuristics — finding the right selectors | Extracted values — transforming them after extraction |
| Selected via | --preset=ecommerce on the CLI | "pipeline_names": [...] in the blueprint JSON |
| Outlives the run? | No — only its effect on the generated blueprint persists | Yes — read from the blueprint on every future run |
Presets (field-detection heuristics)
A preset is a named bundle of heuristics used only by datahelm:scrap:generate — the auto-detection step that guesses selectors on a site it has never seen. It has no effect once a blueprint is saved.
php artisan datahelm:scrap:generate "https://example.com/listing" --preset=ecommerce
Built-in presets (config/crawler.php → presets), selected with --preset= or the CRAWLER_PRESET env var (default generic):
| Preset | Use for |
|---|---|
generic | Unsure / mixed content — safe default for any country, any vertical |
ecommerce | Online shops, marketplaces (adds handle, product-image, qty, …) |
auctions | Auction/lot listings |
properties | Real-estate listings |
Each preset is an array of hints the detectors match against CSS classes, HTML attributes and JSON field names:
| Key | Controls |
|---|---|
price_patterns | Regexes for currency formats ($, R$, €, £, …) — locale-specific |
image_field_hints | CSS class / JSON key fragments marking an image field (image, thumb, gallery, …) |
link_field_hints | Same, for the item URL field (url, href, slug, handle, …) |
rating_hints | CSS class fragments for star/score widgets |
stock_hints | CSS class fragments for availability/inventory |
sku_hints | CSS class / JSON key fragments for product codes (SKU, EAN, MPN, …) |
image_path_prefix | Fixed URL path segment identifying image URLs on a known platform (e.g. VTEX's /arquivos/); null = auto |
list_* thresholds | When a repeating block counts as a real item list |
item_schema | Suggested type-coercion map carried into the generated blueprint |
Most hint lists are CSS-class vocabulary and stay in English regardless of the page's display language — developers write class="star-rating" on French/Portuguese/Arabic sites alike. Only price_patterns and image_path_prefix are locale/platform specific.
Adding your own preset
Extend an existing one by merging in local vocabulary, in the published config/crawler.php:
'presets' => [
// ...
'auctions_pt_BR' => [
'price_patterns' => ['/R\$\s*[\d.,]+/'],
'image_field_hints' => array_merge(
['image', 'img', 'photo', 'thumb'],
['foto', 'fotos', 'imagem', 'imagens', 'galeria'],
),
'link_field_hints' => ['url', 'link', 'href', 'permalink', 'lote'],
'image_path_prefix' => '/arquivos/',
],
],
php artisan datahelm:scrap:generate <url> --preset=auctions_pt_BR
Item pipeline
The pipeline runs on every crawl execution, transforming each already-extracted item before it is exported. By default (config/crawler.php → pipeline) every item passes through:
TrimProcessor— collapses whitespace and trims every string field.AbsoluteUrlProcessor— resolves relativelink/image/gallery_images/ … URLs against the page they were scraped from.
Overriding per blueprint: pipeline_names
A blueprint can replace the global pipeline for itself with pipeline_names — a list of short names resolved against config('crawler.pipeline_registry'):
'pipeline_registry' => [
'trim' => TrimProcessor::class,
'absolute_url' => AbsoluteUrlProcessor::class,
'schema_coercion' => SchemaCoercionProcessor::class,
],
{ "pipeline_names": ["trim", "schema_coercion"] }
Replacement, not addition
Listing ["trim"] runs only TrimProcessor for that blueprint — AbsoluteUrlProcessor no longer runs, so relative URLs are left as-is. Leaving pipeline_names empty ([], the default) keeps the global pipeline untouched — most blueprints never need to set this.
Adding a custom processor
Implement ItemProcessor, register it, then reference it by name:
use DataHelm\Crawler\Pipeline\ItemProcessor;
use DataHelm\Crawler\Scraping\ScrapedItem;
final class StripEmojiProcessor implements ItemProcessor
{
public function process(ScrapedItem $item, string $pageUrl): ScrapedItem
{
$title = $item->get('title');
if (is_string($title)) {
$item->set('title', preg_replace('/[\x{1F300}-\x{1FAFF}]/u', '', $title));
}
return $item;
}
}
// config/crawler.php
'pipeline_registry' => [
// ...
'strip_emoji' => \App\Pipeline\StripEmojiProcessor::class,
],
{ "pipeline_names": ["trim", "absolute_url", "strip_emoji"] }
Next: Result filters →

