Skip to content
On this page

JavaScript sites & JSON APIs (API mode)

Many modern sites (e.g. Copart) are JavaScript SPAs: the server returns an empty HTML shell and the listing is fetched by the browser from a JSON API in the background. Scraping the server HTML finds nothing, so the generator falls back to API mode — it calls that JSON endpoint directly and reads fields by dot-path instead of CSS selectors. This is faster and far more reliable than rendering the page in a headless browser, and needs no Chromium.

How detection works

When HTML list detection fails, the generator:

  1. Checks for SPA markers (id="root", __NEXT_DATA__, ng-app, mostly-script pages, …).
  2. Scans the HTML/inline scripts for candidate data endpoints (/api/…, /public/…search, fetch("…"), …).
  3. Fetches each candidate, auto-detects the items array (the largest list of objects in the JSON), and scaffolds a type: json field for every scalar key in the first record.

Auto-discovery is best-effort — it works for endpoints reachable with a simple GET. When the endpoint needs a POST body (Copart's does) or is signed, pass it explicitly:

bash
php artisan datahelm:scrap:generate \
  "https://www.copart.com.br/vehicleFinderSearch/?..." \
  --api-endpoint="https://www.copart.com.br/public/vehicleFinder/search" \
  --api-method=POST \
  --api-items-path="data.results.content" \
  --get-primary-image=true --hash-names=true --robot --robot-name=Copart

The generator prints diagnostic notes (SPA detected, endpoint probed, records found, fields scaffolded) so you can see what it decided.

The api blueprint block

API-mode blueprints carry "mode": "api" and an api block:

json
{
  "mode": "api",
  "api": {
    "endpoint": "https://www.copart.com.br/public/vehicleFinder/search",
    "method": "POST",
    "headers": { "Content-Type": "application/json" },
    "body": { "query": ["*:*"], "filter": {}, "sort": ["lot_number asc"] },
    "query": {},
    "items_path": "data.results.content",
    "total_path": "data.results.totalElements",
    "page_param": "page",
    "page_size_param": "size",
    "page_size": 100,
    "start_page": 0,
    "page_in_body": false,
    "detail": {
      "enabled": false,
      "endpoint": "https://www.copart.com.br/public/data/lotdetails/solr/{lot}",
      "method": "GET",
      "items_path": "data.lotDetails"
    }
  },
  "fields": [
    { "name": "lot",   "css": "lotNumberStr", "type": "json" },
    { "name": "title", "css": "makeName",     "type": "json" },
    { "name": "year",  "css": "lotYear",      "type": "json" },
    { "name": "image", "css": "tims",         "type": "json" }
  ]
}
FieldMeaning
endpointJSON URL to call
methodGET or POST
headersExtra request headers (auth, Content-Type, …)
bodyRequest body for POST (associative object)
body_formatjson (default) or formform sends application/x-www-form-urlencoded; nested objects become key[sub]=… (DataTables/Copart style). Booleans must be written as the strings "true"/"false" in form mode
queryQuery-string parameters merged into the URL
items_pathDot-path to the array of records ("" = the root is the list)
total_pathOptional dot-path to the total count — stops paging when reached
page_paramQuery/body key for the page number (null = single request, no pagination)
page_size_paramQuery/body key for the page size
page_sizeRecords per page
start_pageFirst page index (0 zero-based, 1 one-based)
page_in_bodyInject page/size into the JSON body instead of the query string
detailOptional per-item second request (see below)

JSON fields

In API mode every field has "type": "json" and its css holds the dot-path into each record:

  • "makeName" → top-level key
  • "lotDetails.year" → nested key
  • "images.0.url" → first element of a list
  • "multiple": true returns the whole array at the path (e.g. an image gallery)
  • regex still post-processes the extracted string

Mixed css/xpath/json types are allowed on the same FieldSelector, but in API mode only json fields resolve (others yield null).

Per-item detail requests

Set api.detail.enabled = true to fetch a second JSON document per item. The endpoint is a template — {field} placeholders are replaced with the item's already-extracted values (URL-encoded):

json
"detail": {
  "enabled": true,
  "endpoint": "https://www.copart.com.br/public/data/lotdetails/solr/{lot}",
  "method": "GET",
  "items_path": "data.lotDetails"
}

Then add type: json entries to the blueprint's detail_fields[]; their paths are resolved against the object at detail.items_path.

Some "APIs" aren't clean JSON-in/JSON-out. Copart's /public/vehicleFinder/search is a DataTables endpoint: a POST with an application/x-www-form-urlencoded body (draw, start, length, filter[MISC], query, page, size) that returns { "draw":…, "recordsTotal":…, "recordsFiltered":…, "data":[…] }. It is also guarded by Imperva/Incapsula anti-bot cookies. To handle this:

  • set api.body_format to "form" (nested objects encode as filter[MISC]=…);
  • write booleans as strings ("true"/"false") so they survive form-encoding;
  • set items_path to data and total_path to recordsFiltered;
  • use page_in_body: true with page_param: page, page_size_param: size (the engine also advances DataTables start/length automatically when those keys exist in the body);
  • paste the session/anti-bot cookies into http_config.cookies (reese84, incap_ses_*, visid_incap_*, the member/session ids).

The fastest way to build such a blueprint is to copy the request from DevTools (Network → the search XHR → Copy as cURL) and translate its URL, headers, body and cookies into the blueprint. The bundled RobotCopart was created this way.

Cookies expire

Anti-bot tokens like reese84 are short-lived. When the robot starts returning HTML/403 instead of JSON (the crawler prints an API: response is not JSON … hint to stderr), re-capture the request and refresh http_config.cookies. For unattended long-term scraping you'd want a headless-browser transport to mint fresh cookies — a separate, heavier add-on.

What carries over

Everything else works the same in API mode: get_*_images / hash_names (the image / images fields just hold URLs from the JSON), dedup, conditional result_filters, output_config (json/jsonl/csv + streaming), crawl delays, max_items / --limit, and the crawl-stats summary. max_pages caps how many API pages are requested.

No usable API?

A handful of sites render entirely client-side with no clean endpoint (or a signed one). Those need a headless-browser transport — a new HttpClient backed by Playwright/Panther bound in AppServiceProvider, which would render the page to HTML so the normal CSS/XPath detection applies. That's a heavier, separate addition (Chromium in the image) and isn't needed for API-backed sites like Copart.


Next: Infinite scroll →

Released under the MIT License.