Skip to content
On this page

Crawling multiple categories in one robot

Point the positional URL at the site base and pass a JSON array of category pages with --search-filters. URLs are resolved against the base (relative or absolute), and each entry's tag (e.g. category) is stamped onto every item from that page. One robot crawls them all into a single output, with dedup, the item limit, and result_filters shared across all of them.

Field detection runs on the first filter (which must be a real listing).

bash
php artisan datahelm:scrap:generate \
  "https://www.example-fashion.com" \
  --get-detail=true --get-primary-image=true --hash-names=true \
  --robot-name=example-fashion \
  --search-filters='[
    {"url": "mens/knitwear-sweaters/", "category": "knitwear-sweaters-men"},
    {"url": "womens/dresses/",         "category": "dress-women"}
  ]'

WARNING

The --search-filters value must be a single-quoted JSON string on one logical argument. Use any tag key you like (category, type, …) — every key besides url is copied onto each item.

Each scraped item then carries the tag:

json
{ "title": "Grey Sweater",   "price": "$ 149.90", "category": "knitwear-sweaters-men" }
{ "title": "Floral Dress",   "price": "$ 199.90", "category": "dress-women" }

In the blueprint

The base stays in url and each filter keeps its relative suffix under url_sufix (resolved against url at crawl time):

json
"url": "https://www.example-fashion.com",
"search_filters": [
  { "url_sufix": "mens/knitwear-sweaters/", "category": "knitwear-sweaters-men" },
  { "url_sufix": "womens/dresses/", "category": "dress-women" }
]

On input you may use url_sufix or url for the path, and a bare string entry ("mens/knitwear-sweaters/") crawls that page without tagging. An absolute suffix (https://…) is used as-is.

API mode

API mode uses its own api.endpoint, so search_filters is for HTML crawls.

Per-filter limit

Add a limit to a filter to cap how many items that category contributes (0 / omitted = unlimited). This is the per-category quota the global --limit can't give you — --limit is a single total shared across all filters, so the first category would otherwise consume it entirely:

json
"search_filters": [
  { "url_sufix": "shop/wd/mens",   "category": "mens",  "limit": 40 },
  { "url_sufix": "shop/wd/womens", "category": "womens", "limit": 40 },
  { "url_sufix": "shop/wd/womens-bottoms-wide-leg-jeans-jeans", "category": "jeans", "limit": 40 }
]

→ up to 40 items from each category (120 total). limit is a control key, not an item tag. A global --limit still applies on top as an overall cap.

search_filters vs. result_filters

These are different tools:

Chooses
search_filterswhich URLs to crawltags items, per-category limits
result_filterswhich items to keepsee Result filters

result_filters then apply to the items from all of the search filters.


Next: Result filters →

Released under the MIT License.