Skip to content
On this page

Images

There are two distinct stages: getting image URLs into each item's JSON (controlled by blueprint flags during detection/extraction), and downloading + processing those images (done in your robot's PHP). They are deliberately separate.

Getting image URLs (get-* flags)

These flags decide which image URLs are written into each item's JSON. They do not download anything — the engine just gets the URLs and puts them in the item; the actual saving is your robot's job. Three independent switches:

FlagBlueprint keyWhat lands in the item
--get-primary-image=trueget_primary_imageA single primary_image URL — the most relevant photo per item
--get-all-images=trueget_all_imagesEvery image URL in all_images, plus primary_image and gallery_images when detail is scraped
--get-gallery-images=trueget_gallery_imagesThe detail-page gallery as the gallery_images array (implies --get-detail)

Primary image

Whenever get_primary_image or get_all_images is on, every item gets a primary_image field — the single most relevant URL, chosen by:

  1. The image field (the thumbnail from the list card) — preferred.
  2. The first real photo in the gallery_images array (detail-page gallery) — fallback when the list shows no image.

Icon/badge URLs are down-scored, so a real photo wins over a small badge even when the badge appears first. primary_image always holds the URL; the stored path (after you download it) is whatever you choose to record in your callback.

Where images come from

By default one image is taken from the list row (a single image field, multiple: false). When a record has several photos, set scrape_detail: true and use a detail field with multiple: true (named images) — it returns an array of every matching image URL from the detail page. Set attribute to whichever holds the URL (src, data-src, …).

Saving images

The blueprint only gets image URLs into the JSON — it never writes files. Downloading happens inside the robot command, in a per-item CallbackSink callback. The generated robot already wires this up. Two protected properties at the top of the class set the target disk and folder — the only two lines you normally need to change:

php
class RobotExampleMarket extends Command
{
    use ScrapesToConsole;

    /** Any Laravel filesystem disk: 'storage' (local), 'public', 's3', 'gcs', … */
    protected string $imageDisk = 'storage';

    /** Subfolder inside the disk where images for this site will be stored. */
    protected string $imageFolder = 'scrapes/images/www.example-market.com';

    public function handle(CrawlEngine $engine, ImageStore $images): int
    {
        $blueprint = ScrapeBlueprint::fromJson(self::BLUEPRINT);
        $hashNames = $blueprint->hashNames;   // from blueprint JSON

        $sink = new CallbackSink(function (ScrapedItem $item) use ($images, $hashNames): void {
            $imageUrl = $item->get('primary_image') ?? $item->get('image');
            if (is_array($imageUrl)) {
                $imageUrl = $imageUrl[0] ?? null;
            }

            $imagePath = is_string($imageUrl) && $imageUrl !== ''
                ? $images->store($imageUrl, $this->imageDisk, $this->imageFolder, $hashNames)
                : null;

            // Uncomment to also store the full gallery from "gallery_images":
            // foreach ((array) $item->get('gallery_images') as $url) { ... }

            // ... build your record and persist it (Eloquent, queue, webhook, …)
        }, $this->imageFolder);

        $this->crawlToSink($engine, $blueprint, $sink, (int) $this->option('limit'));

        return self::SUCCESS;
    }
}

Cloud disks just need their Flysystem adapter installed and configured in config/filesystems.php.

--hash-names

When set, stored images are renamed to a unique content hash on download (hash_names: true in the blueprint). Prevents collisions and gives content-addressable filenames.

Image folder override

Override the default storage path in the blueprint:

json
"image_folder": "scrapes/images/exampleauctions/2026"

Default when null: scrapes/images/{host}/.

Image processing (resize / watermark / convert)

The crawler stores images as-is; processing is application logic, so it lives in your robot's PHP, not in the blueprint JSON. Every generated robot ships with a processImage() hook that runs after each image is downloaded — empty by default. Fill it in with Intervention Image (GD/Imagick, both in the Docker image):

bash
composer require intervention/image
php
use Intervention\Image\ImageManager;
use Intervention\Image\Drivers\Gd\Driver;
use Illuminate\Support\Facades\Storage;

protected function processImage(?string $path): void
{
    if ($path === null) {
        return;
    }

    $manager = new ImageManager(new Driver());
    $image   = $manager->read(Storage::disk($this->imageDisk)->path($path));

    $image->scaleDown(width: 800);                                       // resize
    // $image->place('storage/app/watermark.png', 'bottom-right', 10, 10); // watermark

    Storage::disk($this->imageDisk)->put($path, (string) $image->encodeByExtension());
}

The full Intervention API (crop, cover, blur, text, format conversion, …) is available here — far more than a fixed JSON schema could express. The hook is called automatically from the per-item callback in handle().

In API mode

Everything above works the same: the image / images fields just hold URLs pulled from the JSON by dot-path instead of from HTML. See API mode.


Continue to the Reference section.

Released under the MIT License.