JavaScript sites & JSON APIs (API mode)

Many modern sites (e.g. Copart) are JavaScript SPAs: the server returns an empty HTML shell and the listing is fetched by the browser from a JSON API in the background. Scraping the server HTML finds nothing, so the generator falls back to API mode — it calls that JSON endpoint directly and reads fields by dot-path instead of CSS selectors. This is faster and far more reliable than rendering the page in a headless browser, and needs no Chromium.

How detection works

When HTML list detection fails, the generator:

Checks for SPA markers (id="root", __NEXT_DATA__, ng-app, mostly-script pages, …).
Scans the HTML/inline scripts for candidate data endpoints (/api/…, /public/…search, fetch("…"), …).
Fetches each candidate, auto-detects the items array (the largest list of objects in the JSON), and scaffolds a type: json field for every scalar key in the first record.

Auto-discovery is best-effort — it works for endpoints reachable with a simple GET. When the endpoint needs a POST body (Copart's does) or is signed, pass it explicitly:

bash

php artisan datahelm:scrap:generate \
  "https://www.copart.com.br/vehicleFinderSearch/?..." \
  --api-endpoint="https://www.copart.com.br/public/vehicleFinder/search" \
  --api-method=POST \
  --api-items-path="data.results.content" \
  --get-primary-image=true --hash-names=true --robot --robot-name=Copart

The generator prints diagnostic notes (SPA detected, endpoint probed, records found, fields scaffolded) so you can see what it decided.

The `api` blueprint block

API-mode blueprints carry "mode": "api" and an api block:

json

{
  "mode": "api",
  "api": {
    "endpoint": "https://www.copart.com.br/public/vehicleFinder/search",
    "method": "POST",
    "headers": { "Content-Type": "application/json" },
    "body": { "query": ["*:*"], "filter": {}, "sort": ["lot_number asc"] },
    "query": {},
    "items_path": "data.results.content",
    "total_path": "data.results.totalElements",
    "page_param": "page",
    "page_size_param": "size",
    "page_size": 100,
    "start_page": 0,
    "page_in_body": false,
    "detail": {
      "enabled": false,
      "endpoint": "https://www.copart.com.br/public/data/lotdetails/solr/{lot}",
      "method": "GET",
      "items_path": "data.lotDetails"
    }
  },
  "fields": [
    { "name": "lot",   "css": "lotNumberStr", "type": "json" },
    { "name": "title", "css": "makeName",     "type": "json" },
    { "name": "year",  "css": "lotYear",      "type": "json" },
    { "name": "image", "css": "tims",         "type": "json" }
  ]
}

Field	Meaning
`endpoint`	JSON URL to call
`method`	`GET` or `POST`
`headers`	Extra request headers (auth, `Content-Type`, …)
`body`	Request body for `POST` (associative object)
`body_format`	`json` (default) or `form` — `form` sends `application/x-www-form-urlencoded`; nested objects become `key[sub]=…` (DataTables/Copart style). Booleans must be written as the strings `"true"`/`"false"` in form mode
`query`	Query-string parameters merged into the URL
`items_path`	Dot-path to the array of records (`""` = the root is the list)
`total_path`	Optional dot-path to the total count — stops paging when reached
`page_param`	Query/body key for the page number (`null` = single request, no pagination)
`page_size_param`	Query/body key for the page size
`page_size`	Records per page
`start_page`	First page index (`0` zero-based, `1` one-based)
`page_in_body`	Inject `page`/`size` into the JSON body instead of the query string
`detail`	Optional per-item second request (see below)

JSON fields

In API mode every field has "type": "json" and its css holds the dot-path into each record:

"makeName" → top-level key
"lotDetails.year" → nested key
"images.0.url" → first element of a list
"multiple": true returns the whole array at the path (e.g. an image gallery)
regex still post-processes the extracted string

Mixed css/xpath/json types are allowed on the same FieldSelector, but in API mode only json fields resolve (others yield null).

Per-item detail requests

Set api.detail.enabled = true to fetch a second JSON document per item. The endpoint is a template — {field} placeholders are replaced with the item's already-extracted values (URL-encoded):

json

"detail": {
  "enabled": true,
  "endpoint": "https://www.copart.com.br/public/data/lotdetails/solr/{lot}",
  "method": "GET",
  "items_path": "data.lotDetails"
}

Then add type: json entries to the blueprint's detail_fields[]; their paths are resolved against the object at detail.items_path.

Some "APIs" aren't clean JSON-in/JSON-out. Copart's /public/vehicleFinder/search is a DataTables endpoint: a POST with an application/x-www-form-urlencoded body (draw, start, length, filter[MISC], query, page, size) that returns { "draw":…, "recordsTotal":…, "recordsFiltered":…, "data":[…] }. It is also guarded by Imperva/Incapsula anti-bot cookies. To handle this:

set api.body_format to "form" (nested objects encode as filter[MISC]=…);
write booleans as strings ("true"/"false") so they survive form-encoding;
set items_path to data and total_path to recordsFiltered;
use page_in_body: true with page_param: page, page_size_param: size (the engine also advances DataTables start/length automatically when those keys exist in the body);
paste the session/anti-bot cookies into http_config.cookies (reese84, incap_ses_*, visid_incap_*, the member/session ids).

The fastest way to build such a blueprint is to copy the request from DevTools (Network → the search XHR → Copy as cURL) and translate its URL, headers, body and cookies into the blueprint. The bundled RobotCopart was created this way.

Cookies expire

Anti-bot tokens like reese84 are short-lived. When the robot starts returning HTML/403 instead of JSON (the crawler prints an API: response is not JSON … hint to stderr), re-capture the request and refresh http_config.cookies. For unattended long-term scraping you'd want a headless-browser transport to mint fresh cookies — a separate, heavier add-on.

What carries over

Everything else works the same in API mode: get_*_images / hash_names (the image / images fields just hold URLs from the JSON), dedup, conditional result_filters, output_config (json/jsonl/csv + streaming), crawl delays, max_items / --limit, and the crawl-stats summary. max_pages caps how many API pages are requested.

No usable API?

A handful of sites render entirely client-side with no clean endpoint (or a signed one). Those need a headless-browser transport — a new HttpClient backed by Playwright/Panther bound in AppServiceProvider, which would render the page to HTML so the normal CSS/XPath detection applies. That's a heavier, separate addition (Chromium in the image) and isn't needed for API-backed sites like Copart.

Next: Infinite scroll →

JavaScript sites & JSON APIs (API mode) #

How detection works #

The api blueprint block #

JSON fields #

Per-item detail requests #

Form-encoded & cookie-protected endpoints (the real Copart case) #

What carries over #