Skip to content
On this page

Scaffold a robot

A robot is a self-contained Artisan command with the blueprint JSON embedded directly in the file — so it needs no external storage. It is also where per-item logic lives: image downloading, image processing, and persistence (Eloquent, queue, webhook).

Pass --robot to the generator to turn a detected blueprint into one:

bash
php artisan datahelm:scrap:generate \
  https://www.exampleauctions.com/real-estate/apartments --get-detail=true --robot --robot-name=ExampleAuctions
# -> creates app/Console/Commands/RobotsCommand/RobotExampleAuctions.php

php artisan datahelm:robot:exampleauctions --limit=20

The robot name defaults to the host (exampleauctionsRobotExampleauctions); pass --robot-name= for exact casing, and --force to overwrite an existing file. Edit the embedded BLUEPRINT JSON in the generated command to refine selectors.

Anatomy of a robot

The generated handle() method loads the embedded blueprint, builds a CallbackSink that runs once per scraped item, and streams items into it:

php
class RobotExampleMarket extends Command
{
    use ScrapesToConsole;

    /** Any Laravel filesystem disk: 'storage' (local), 'public', 's3', 'gcs', … */
    protected string $imageDisk = 'storage';

    /** Subfolder inside the disk where images for this site will be stored. */
    protected string $imageFolder = 'scrapes/images/www.example-market.com';

    public function handle(CrawlEngine $engine, ImageStore $images): int
    {
        $blueprint = ScrapeBlueprint::fromJson(self::BLUEPRINT);
        $hashNames = $blueprint->hashNames;   // from blueprint JSON

        $sink = new CallbackSink(function (ScrapedItem $item) use ($images, $hashNames): void {
            // "primary_image" is the URL the engine resolved (falls back to "image").
            $imageUrl = $item->get('primary_image') ?? $item->get('image');
            if (is_array($imageUrl)) {
                $imageUrl = $imageUrl[0] ?? null;
            }

            // Download to $this->imageDisk / $this->imageFolder.
            $imagePath = is_string($imageUrl) && $imageUrl !== ''
                ? $images->store($imageUrl, $this->imageDisk, $this->imageFolder, $hashNames)
                : null;

            // Optional: resize/watermark/convert via processImage().
            // Uncomment to also store the full gallery from "gallery_images":
            // foreach ((array) $item->get('gallery_images') as $url) { ... }

            // ... build your record and persist it (Eloquent, queue, webhook, …)
        }, $this->imageFolder);

        $this->crawlToSink($engine, $blueprint, $sink, (int) $this->option('limit'));

        return self::SUCCESS;
    }
}

The two lines you usually change

PropertyPurpose
$imageDiskAny Laravel filesystem disk — 'storage', 'public', 's3', 'gcs', …
$imageFolderSubfolder inside that disk where this site's images go

Cloud disks just need their Flysystem adapter installed and configured in config/filesystems.php. See RobotExampleMarket in the reference project for a complete worked example.

Per-item persistence

Inside the CallbackSink closure you have the full ScrapedItem. This is where you:

  • download images ($images->store(...)) — see Images;
  • run processImage() to resize / watermark / convert;
  • build your record and persist it however you like — Eloquent model, dispatched job, webhook POST, etc.

Because it is plain PHP inside a Laravel command, anything your app can do, a robot can do per item.

Run options

Every generated robot supports the same run-time flags as datahelm:scrap:run:

bash
php artisan datahelm:robot:exampleauctions --limit=20                       # cap items
php artisan datahelm:robot:exampleauctions --output=storage/app/out.json    # custom path
php artisan datahelm:robot:exampleauctions --output=- > out.json            # stdout

Next: Selector shell →

Released under the MIT License.