Tassos Docs

Google Structured Data v6.2.0

Contact Support

Home / Google Structured Data / Functionality

Convert HTML to Markdown

Serve clean Joomla-aware Markdown pages with dedicated .md URLs. Help AI Agents like ChatGPT and Claude read, understand and share your valuable content.

HTML to Markdown Conversion is available in Pro

Unlock this and dozens of other powerful features by upgrading to Google Structured Data Pro.

Unlock this feature

How content is discovered online is changing. Traffic no longer comes only from search engines. AI crawlers and agents are now a significant source, and they operate on a web that was built for humans.

Treating agents as first-class visitors means giving them content they can actually use. They fetch full HTML pages, scripts, modules, chrome, and all, and every unnecessary token costs more to process. Markdown strips that away, leaving only the content and your JSON-LD structured data.

This guide will help you enable HTML-to-Markdown conversion on your Joomla site.

Convert HTML to Markdown in Joomla

Follow the steps below to enable real-time HTML to markdown conversion in any page in your Joomla site.

Log in to your Joomla administrator
Go to Components → Google Structured Data → Configuration
Open the AI Agents tab
Enable the Convert Pages to Markdown toggle
Optionally, set the Content Extraction Method to control which part of the page gets converted
- Component Buffer extracts the component's raw HTML output before the template wraps it. This is the cleanest option and works well for standard Joomla articles and views.
- CSS Selector lets you target a specific element in the final rendered HTML. Useful when your content lives inside a page builder or custom template wrapper. Set your selector in the field that appears below (e.g. .article-content, #main article).
- Full Page converts the entire rendered HTML once Joomla has finished rendering. Use it as a last resort when the other 2 methods miss content.

Accessing Markdown Pages

There are 3 ways to request a Markdown page.

.md URL suffix: Append .md to any page URL. For example, /my-article.md returns the Markdown version of /my-article. These URLs are independently cacheable and work in any HTTP client without setting custom headers. Joomla's SEF URLs must be enabled for this method to work.

Open this page as Markdown
?markdown=1 query parameter: A fallback for testing or for clients where you can't set custom headers. Works on any page.
Content Negotiation: An agent that sends Accept: text/markdown gets Markdown automatically. This allows clients to request different representations of the same resource using the HTTP-standard Accept header, rather than requiring different URLs. The same URL serves HTML to browsers and Markdown to agents, depending on the client's request. No URL changes needed on your end.

Try it yourself:
```
# Responds with HTML
curl https: https://tassos.gr/docs/google-structured-data/functionality/html-to-markdown

# Responds with Markdown
curl -H "Accept: text/markdown" https://tassos.gr/docs/google-structured-data/functionality/html-to-markdown
```

What's Included in a Markdown Page

Each Markdown page response is structured and predictable, making it easy for both humans and AI agents to parse and use.

YAML Frontmatter

YAML frontmatter is an authoring convention popularized by Jekyll that provides a way to add structured metadata to Markdown pages. It is a block of key-value content enclosed between two delimiters at the top of the file, and is well understood by most static site generators, AI agents, and developer tooling.

In each Markdown response, the frontmatter includes the following properties:

title: The page title as set in the document.
description: The page meta description.
url: The canonical URL of the HTML page. Its primary job is to tell an agent "this content belongs to this page" so it can cite it properly in a response to a human.
date: The publication date of the content in ISO 8601 format. Falls back to the current date if no publication date is available.
language: The language tag of the current page (e.g. en-GB).

Page Content

This is the main body of the page, cleaned up and transformed for readability. The exact content included depends on the Content extraction method configured in the plugin settings. The following transformations are applied:

Elements stripped. Navigation, scripts, styles, forms, and interactive elements are removed, leaving only the readable text.
Relative URLs converted to absolute. All relative links and image sources are resolved to their full absolute URLs.

JSON-LD Schema

At the bottom of every Markdown response, the full JSON-LD structured data is included as a fenced code block. This is the exact same schema markup that Google Structured Data generates for search engines, types like Article, Product, Event, and more, giving AI agents rich semantic context about the page in a format they can parse directly, without having to infer it from the content.

Header Responses

Every Markdown response also includes the following HTTP headers:

X-Markdown-Tokens an estimated count of how many tokens the Markdown response contains. AI models have a limit on how much text they can process at once, often called a context window. This header lets an agent check the size of a page before reading it, so it can decide whether it fits within that limit.
Vary: Accept signals to caches that the response differs by content type, so browsers and agents each get the right version.
X-Robots-Tag: noindex, no follow keeps Markdown URLs out of search results. Without this, search engines could index both the HTML and Markdown versions of the same page, creating duplicate content. This header tells crawlers to skip the Markdown URL entirely and index only the canonical HTML page. `nofollow` prevents crawlers from following links in the Markdown response — since those links already exist on the canonical HTML page, there is nothing new for a crawler to discover here.
Link: <url>; rel="canonical" points HTTP clients and caches back to the canonical HTML URL, mirroring the url property in the YAML frontmatter, but at the transport layer.

Helping Agents Discover Markdown Version

Google Structured Data injects a <link rel="alternate" type="text/markdown"> tag into the <head> of every HTML page. Agents that visit the HTML version can find the Markdown URL automatically, without any prior knowledge of the URL structure.

<link rel="alternate" type="text/markdown" href="https://site.com/docs/getting-started.md" />

The alternate URL format depends on your site's SEF URL setting. With SEF URLs enabled, it uses the .md suffix (e.g. /my-article.md). With SEF URLs disabled, it falls back to the ?markdown=1 query parameter.

Caching Markdown Pages

Markdown output is cached using Joomla's cache layer. Each entry is keyed by URL and language, so multilingual sites get separate cached entries per language.

Automatic invalidation. The cache is cleared automatically when an article is updated in the backend.
Lifetime. Cache duration follows the value set in Joomla's global configuration under System → Global Configuration → Cache Time.
Logged-in users are never cached. Markdown pages for logged-in users are always generated fresh. This prevents content that's only visible to authenticated users from leaking into a cached response that a guest could later retrieve.

Clearing the Cache

To manually clear the Markdown cache for all pages on your site:

Log in to your Joomla administrator
Go to System → Maintenance → Clear Cache
Find gsd_markdown in the list of cache groups
Select it and click Delete

This wipes the cached Markdown version of every page on your site. The next request to any Markdown URL will regenerate and re-cache it.

Forcing a Cache Refresh

Add ?force_update=1 to a Markdown request for that page. The cached version is discarded and a fresh response is generated and stored in its place. This only works when the request is already a Markdown request:

https://example.com/my-article.md?force_update=1
https://example.com/my-article?markdown=1&force_update=1

It also works when sending an Accept: text/markdown header with ?force_update=1. Visiting a regular HTML URL with ?force_update=1 alone has no effect.

Cloudflare Markdown for Agents: Compatibility and Limitations

Cloudflare recently launched a feature called Markdown for Agents that can convert HTML pages to Markdown at the CDN level. When enabled on your Cloudflare zone, Cloudflare intercepts requests with an Accept: text/markdown header, fetches the HTML from your origin, converts it on the fly, and returns Markdown to the agent.

If you use both, disable one of them. When both are active, Cloudflare forwards the Accept: text/markdown header to your origin and Google Structured Data intercepts it, serving Markdown back to Cloudflare. What Cloudflare does at that point is not confirmed — it may detect the Content-Type: text/markdown response and pass it through unchanged, or it may attempt to re-process it. The safest approach is to run only one at a time.

The recommendation is to keep Google Structured Data's conversion enabled and disable Cloudflare's Markdown for Agents. Here's why:

Cloudflare has no knowledge of your Joomla structure. It converts the full HTML, including template chrome, module positions, and any other blocks on the page. Google Structured Data knows exactly which part of the page is the component output and strips everything else out.
Cloudflare can't include your structured data. The JSON-LD schema Google Structured Data generates for search engines is appended to the Markdown response. No CDN layer can replicate that, because the data comes from your Joomla configuration and content.
Cloudflare's frontmatter is limited. It includes a title and description. Google Structured Data also includes the canonical URL, publication date, and language tag.
No dedicated Markdown URLs. Cloudflare's conversion only responds to the Accept: text/markdown header. There are no .md URLs or ?markdown=1 parameters — no way to link directly to a Markdown version of a page.
No discoverability. Cloudflare doesn't inject a <link rel="alternate" type="text/markdown"> tag into your HTML pages. Agents visiting the HTML version have no way to discover that a Markdown version exists.

If you prefer to use Cloudflare's conversion instead, disable the Convert Pages to Markdown toggle in Google Structured Data's configuration. The two should not run at the same time.

Troubleshooting

The Markdown version shows outdated content

The page is being served from cache. Add ?force_update=1 to a Markdown request to clear the cached version and regenerate it.

Markdown URLs are being redirected

This usually happens when Joomla's System - SEF plugin runs before System - Google Structured Data and rewrites the URL on its own. Go to System → Manage → Plugins, filter by type System, and give System - Google Structured Data a lower ordering number than System - SEF so it runs first. Then request the Markdown URL again with ?force_update=1 to refresh the page's cache, or clear your cache.