Skip to content
On this page

Markdown / LLM-ready output

Turn any page into clean Markdown instead of a wall of HTML — the feature Firecrawl and Crawl4AI are known for, now in the Laravel world. The result drops straight into an LLM context window or a RAG index with no HTML noise.

There are two ways to use it, and they pair naturally: extract an article body as a Markdown field, then export the whole crawl as a Markdown document.

1 · The markdown field type

Render one element's content (an article body, a product description — any long-form block) as Markdown by setting the field's type to "markdown". The css selector locates the element; its content is converted:

json
{
  "name": "description",
  "css": ".product-description",
  "type": "markdown"
}

What the converter preserves and strips:

PreservedStripped
Headings (#######), paragraphs, <br><script>, <style>, <noscript>, <template>
Nested & ordered listsForms, buttons, inputs
Links and imagesMedia embeds (<iframe>, <video>, <svg>, …)
Bold / italic / inline codeSite chrome: <nav>, <header>, <footer>, <aside>
Fenced code blocks with language (```php)
Tables (with | escaping) and blockquotes

Relative URLs are resolved

Links and images inside the converted content are resolved against the page they were scraped from — [Foo](/wiki/Foo) becomes [Foo](https://example.com/wiki/Foo) — so the Markdown stays valid outside its origin site.

Use an empty css ("") in detail_fields[] to convert the whole detail page context, or point it at the main content container for article-only output.

Note: regex is not applied to markdown fields — it would corrupt multi-line output.

2 · The markdown output format

Export a whole crawl as a single Markdown document, one section per item. Set the blueprint's output_config.format:

json
"output_config": {
  "format": "markdown"
}
bash
php artisan datahelm:scrap:run example --output=storage/app/scrapes/example.md

Each item becomes a section:

  • Heading — the most title-like field (title, name, heading, headline, label), falling back to Item N.
  • Body — the first long-form field (markdown, content, body, description, text), rendered as-is.
  • Metadata — every remaining scalar field as a bullet list (- **price:** 150000).
  • Items are separated by --- horizontal rules.

Example output for a quotes site:

markdown
## "The world as we have created it is a process of our thinking."

"The world as we have created it…" by Albert Einstein [(about)](https://quotes.toscrape.com/author/Albert-Einstein)

- **author:** Albert Einstein
- **tags:** change, deep-thoughts, thinking, world

---

## "It is our choices, Harry, that show what we truly are…"

Using the converter standalone

The engine behind both features, DataHelm\Crawler\Markdown\HtmlToMarkdown, is dependency-free (only ext-dom, bundled with PHP) and works on its own:

php
use DataHelm\Crawler\Markdown\HtmlToMarkdown;

// Convert an HTML fragment
$markdown = (new HtmlToMarkdown())->convert($html);

// Resolve relative links/images against the page URL
$markdown = (new HtmlToMarkdown())->convert($html, 'https://example.com/article');

// Keep nav/header/footer/aside instead of stripping them
$markdown = (new HtmlToMarkdown(stripChrome: false))->convert($html);

// Convert a live DOM node (e.g. matched by symfony/dom-crawler)
$markdown = (new HtmlToMarkdown())->convertElement($node, $pageUrl);

Next: Presets & item pipeline →

Released under the MIT License.