Skip to content
On this page

Result filters

A result filter is a bouncer for your results: after the crawler extracts each item, result_filters throw away the ones you don't want, so your output JSON only contains items that match your rules.

The default (enabled: false with an empty rules array) means "keep everything".

Backward-compatible name

The blueprint key is result_filters. The older name filters is still accepted, so existing blueprints keep working.

Worked example

Say a robot scrapes these 4 products:

titleprice
Blue T-Shirt$ 49.90
Floral Dress$ 199.90
Long Dress(empty)
Black Cap$ 29.90

You only want dresses that have a price. Turn filters on:

json
"result_filters": {
  "enabled": true,
  "rules": [
    { "field": "title", "operator": "contains",   "value": "Dress" },
    { "field": "price", "operator": "not_empty" }
  ]
}

Result — only 1 item is saved:

titlepricekept?why
Blue T-Shirt$ 49.90title has no "Dress"
Floral Dress$ 199.90matches both rules
Long Dress(empty)price is empty
Black Cap$ 29.90title has no "Dress"

To use it: open your robot's blueprint, set "enabled": true, and add rules. To turn it off again, set "enabled": false (or remove the rules).

Rule operators

json
"result_filters": {
  "enabled": true,
  "rules": [
    { "field": "price",   "operator": "not_empty" },
    { "field": "title",   "operator": "contains", "value": "Apartment" },
    { "field": "price",   "operator": "matches",  "value": "/\\$\\s*[\\d.,]+/" },
    { "field": "area_m2", "operator": "gt",       "value": "50" }
  ]
}
OperatorCondition
not_emptyfield has a non-empty value
emptyfield is missing or empty
containsfield value contains value
not_containsfield value does not contain value
equalsfield value equals value exactly
not_equalsfield value does not equal value
matchesfield value matches the regex in value
gtfield value (numeric) is greater than value
ltfield value (numeric) is less than value

All rules are evaluated conjunctively — an item must pass all rules to be kept.

Not the same as multiple categories

result_filters decide which items to keep — they do not choose URLs. To crawl several categories of one site with a single robot, use --search-filters; result_filters then apply to the items from all of them.


Next: Images →

Released under the MIT License.