JavaScript sites & JSON APIs (API mode)
Many modern sites (e.g. Copart) are JavaScript SPAs: the server returns an empty HTML shell and the listing is fetched by the browser from a JSON API in the background. Scraping the server HTML finds nothing, so the generator falls back to API mode — it calls that JSON endpoint directly and reads fields by dot-path instead of CSS selectors. This is faster and far more reliable than rendering the page in a headless browser, and needs no Chromium.
How detection works
When HTML list detection fails, the generator:
- Checks for SPA markers (
id="root",__NEXT_DATA__,ng-app, mostly-script pages, …). - Scans the HTML/inline scripts for candidate data endpoints (
/api/…,/public/…search,fetch("…"), …). - Fetches each candidate, auto-detects the items array (the largest list of objects in the JSON), and scaffolds a
type: jsonfield for every scalar key in the first record.
Auto-discovery is best-effort — it works for endpoints reachable with a simple GET. When the endpoint needs a POST body (Copart's does) or is signed, pass it explicitly:
php artisan datahelm:scrap:generate \
"https://www.copart.com.br/vehicleFinderSearch/?..." \
--api-endpoint="https://www.copart.com.br/public/vehicleFinder/search" \
--api-method=POST \
--api-items-path="data.results.content" \
--get-primary-image=true --hash-names=true --robot --robot-name=Copart
The generator prints diagnostic notes (SPA detected, endpoint probed, records found, fields scaffolded) so you can see what it decided.
The api blueprint block
API-mode blueprints carry "mode": "api" and an api block:
{
"mode": "api",
"api": {
"endpoint": "https://www.copart.com.br/public/vehicleFinder/search",
"method": "POST",
"headers": { "Content-Type": "application/json" },
"body": { "query": ["*:*"], "filter": {}, "sort": ["lot_number asc"] },
"query": {},
"items_path": "data.results.content",
"total_path": "data.results.totalElements",
"page_param": "page",
"page_size_param": "size",
"page_size": 100,
"start_page": 0,
"page_in_body": false,
"detail": {
"enabled": false,
"endpoint": "https://www.copart.com.br/public/data/lotdetails/solr/{lot}",
"method": "GET",
"items_path": "data.lotDetails"
}
},
"fields": [
{ "name": "lot", "css": "lotNumberStr", "type": "json" },
{ "name": "title", "css": "makeName", "type": "json" },
{ "name": "year", "css": "lotYear", "type": "json" },
{ "name": "image", "css": "tims", "type": "json" }
]
}
| Field | Meaning |
|---|---|
endpoint | JSON URL to call |
method | GET or POST |
headers | Extra request headers (auth, Content-Type, …) |
body | Request body for POST (associative object) |
body_format | json (default) or form — form sends application/x-www-form-urlencoded; nested objects become key[sub]=… (DataTables/Copart style). Booleans must be written as the strings "true"/"false" in form mode |
query | Query-string parameters merged into the URL |
items_path | Dot-path to the array of records ("" = the root is the list) |
total_path | Optional dot-path to the total count — stops paging when reached |
page_param | Query/body key for the page number (null = single request, no pagination) |
page_size_param | Query/body key for the page size |
page_size | Records per page |
start_page | First page index (0 zero-based, 1 one-based) |
page_in_body | Inject page/size into the JSON body instead of the query string |
detail | Optional per-item second request (see below) |
JSON fields
In API mode every field has "type": "json" and its css holds the dot-path into each record:
"makeName"→ top-level key"lotDetails.year"→ nested key"images.0.url"→ first element of a list"multiple": truereturns the whole array at the path (e.g. an image gallery)regexstill post-processes the extracted string
Mixed css/xpath/json types are allowed on the same FieldSelector, but in API mode only json fields resolve (others yield null).
Per-item detail requests
Set api.detail.enabled = true to fetch a second JSON document per item. The endpoint is a template — {field} placeholders are replaced with the item's already-extracted values (URL-encoded):
"detail": {
"enabled": true,
"endpoint": "https://www.copart.com.br/public/data/lotdetails/solr/{lot}",
"method": "GET",
"items_path": "data.lotDetails"
}
Then add type: json entries to the blueprint's detail_fields[]; their paths are resolved against the object at detail.items_path.
Form-encoded & cookie-protected endpoints (the real Copart case)
Some "APIs" aren't clean JSON-in/JSON-out. Copart's /public/vehicleFinder/search is a DataTables endpoint: a POST with an application/x-www-form-urlencoded body (draw, start, length, filter[MISC], query, page, size) that returns { "draw":…, "recordsTotal":…, "recordsFiltered":…, "data":[…] }. It is also guarded by Imperva/Incapsula anti-bot cookies. To handle this:
- set
api.body_formatto"form"(nested objects encode asfilter[MISC]=…); - write booleans as strings (
"true"/"false") so they survive form-encoding; - set
items_pathtodataandtotal_pathtorecordsFiltered; - use
page_in_body: truewithpage_param: page,page_size_param: size(the engine also advances DataTablesstart/lengthautomatically when those keys exist in the body); - paste the session/anti-bot cookies into
http_config.cookies(reese84,incap_ses_*,visid_incap_*, the member/session ids).
The fastest way to build such a blueprint is to copy the request from DevTools (Network → the search XHR → Copy as cURL) and translate its URL, headers, body and cookies into the blueprint. The bundled RobotCopart was created this way.
Cookies expire
Anti-bot tokens like reese84 are short-lived. When the robot starts returning HTML/403 instead of JSON (the crawler prints an API: response is not JSON … hint to stderr), re-capture the request and refresh http_config.cookies. For unattended long-term scraping you'd want a headless-browser transport to mint fresh cookies — a separate, heavier add-on.
What carries over
Everything else works the same in API mode: get_*_images / hash_names (the image / images fields just hold URLs from the JSON), dedup, conditional result_filters, output_config (json/jsonl/csv + streaming), crawl delays, max_items / --limit, and the crawl-stats summary. max_pages caps how many API pages are requested.
No usable API?
A handful of sites render entirely client-side with no clean endpoint (or a signed one). Those need a headless-browser transport — a new HttpClient backed by Playwright/Panther bound in AppServiceProvider, which would render the page to HTML so the normal CSS/XPath detection applies. That's a heavier, separate addition (Chromium in the image) and isn't needed for API-backed sites like Copart.
Next: Infinite scroll →

