Documentation

Learn how to use robots.txt, sitemaps, and LLMS.txt together with audits and APIs so search engines and AI crawlers can understand your site reliably.

API Reference

Integrate generators and audits programmatically

Introduction

This platform helps you produce and validate the files that define how bots discover and use your content: robots.txt for crawl policies, XML sitemaps for URL discovery, and LLMS-oriented documentation where you describe how AI systems should treat your pages.

Core capabilities

Robots.txt Generator — model user-agent blocks, Allow/Disallow paths, crawl hints, and sitemap declarations.
AI Crawlability Audit — crawl sitemap seeds and internal links, sample broken links, and score robots/sitemap posture (full guide).
Sitemap.xml Generator — crawl from a starting URL and emit standards-compliant XML, with indexing when URL counts grow.
LLMS.txt Generator — summarize compliance-oriented signals into an AI-readable document tailored to training and attribution preferences.

Prefer the UI first, then automate with the HTTP API documented under Endpoints.

Quick Start Guide

Robots.txt
Open Robots.txt Generator, enter your site URL, tune user agents and paths, generate the file, and place /robots.txt at the host root.
Sitemap
Use Sitemap.xml Generator with your canonical HTTPS origin. Download or copy XML, publish at /sitemap.xml (or reference the URL in robots.txt).
LLMS.txt
Generate from LLMS.txt Generator; review wording for legal accuracy before publishing /llms.txt.
Audit
Optionally run AI Crawlability Audit (docs) to validate live robots.txt, declared sitemaps, and internal crawl health.

API Overview

The backend exposes REST endpoints under /api. The Next.js app typically sets NEXT_PUBLIC_API_BASE_URL (or NEXT_PUBLIC_API_URL) to your server origin plus /api, for example http://localhost:3000/api.

Responses are JSON unless you download artifacts (for example analyzer reports). A separate liveness probe lives at GET /health on the same host as the Express app (without the /api prefix).

POST /api/generate/robots
Content-Type: application/json

{
  "url": "https://example.com",
  "userAgents": [
    { "name": "*", "disallow": ["/admin/", "/api/"], "allow": [] }
  ],
  "sitemapUrl": "https://example.com/sitemap.xml",
  "crawlDelay": 1,
  "additionalRules": ["# staging rules"]
}

Browse all endpoints

Authentication

Generator and audit endpoints are intended for authenticated frontends or trusted backends. Today the API accepts optional Authorization: Bearer <token> headers for forward compatibility—the bundled web client attaches a stored token when present.

For production integrations, terminate TLS at your edge, restrict origins via ALLOWED_ORIGINS, and place the API behind your own API gateway or service mesh if you need per-key quotas distinct from IP-based rate limiting.

AI Crawlability Audit — Overview

The AI Crawlability Audit is a bounded HTTP crawl of your site that focuses on how easily machines can discover and fetch your pages. It discovers URLs from robots.txt sitemap declarations and common locations (for example /sitemap.xml), follows internal links up to a depth and page cap, records status codes, canonical tags, and meta name="robots", samples internal links for broken responses, flags query-string URLs, and combines that into a scored report with a plain-text human_summary.

It complements the Robots, Sitemap, and LLMS.txt tools: those help you author policies and files; the audit checks what is live on the wire.

Running an audit (web UI)

Open /crawl-audit, enter a full site URL (for example https://example.com), adjust options if needed, and choose Run Audit. The page POSTs to your API base (from NEXT_PUBLIC_API_BASE_URL or NEXT_PUBLIC_API_URL) at .../api/analyze/crawl, or to /api/analyze/crawl when no base is set (same-origin proxy).

Successful runs render Summary, Robots & Sitemap, a per-URL table, broken-link samples, and query-parameter URLs. Use that layout as a checklist when building your own UI on top of the JSON report.

Options & presets

Request body fields

url — required starting origin; normalized server-side.
maxPages — max URLs processed (default 100 in server and UI).
depthLimit — max link hops from each seed (default 2).
concurrency — parallel in-flight requests, clamped roughly 1–20 on the server (default 12).
followExternal — follow off-domain links matching host rules when true (usually leave off unless you intentionally audit cross-domain properties).
rateLimitMs — optional minimum delay between outbound requests.
renderAllPages — when true, attempts Puppeteer-based rendering where available to capture JS-only links (much slower).

Built-in presets (UI)

Fast— fewer pages and depth, higher concurrency, rendering off (~1 minute class runs on typical sites).
Balanced — ~100 pages, depth 2, default concurrency, rendering off.
Thorough — more pages and depth, lower concurrency, rendering on (longer runs; requires a working headless browser install for render mode).

Report structure

The API wraps the payload as { success, report, version, timestamp }. The report object includes:

base_url, audit_date_utc, total_urls_discovered
sitemap_urls — URLs collected from robots and common sitemap locations.
robots — simplified allow, disallow, sitemaps extracted from robots.txt (missing file is non-fatal).
crawl_results_summary — final_score (0–100), crawlable_urls, broken_count, advisory sitemap_score and robots_score (each typically 5 or 10 in the heuristic).
broken_links_sample — up to hundreds of problematic URLs/strings.
per_url — map of URL → status, ok, canonical, metaRobots, links, out_broken_sample (broken internal sample per page), hasQuery, depth.
human_summary — multi-line text summary aligned with scores.

Persist or diff reports by audit_date_utc and canonical base_url when trending improvements across deploys.

How scoring works

The composite final_score (0–100) blends heuristic sitemap discovery, rudimentary robots presence signals, successful fetch ratio across crawled URLs, and a penalty scaled from sampled broken/internal errors. Exact weighting lives in crawlAuditService on the server—treat it as a directional health indicator, not a substitute for Google Search Console, log analysis, or vendor-specific crawl simulators.

Use sitemap_score / robots_score at face value only for quick regressions between deploys; they reward having discoverable sitemaps and a non-empty robots posture rather than deep semantic correctness of every directive.

API integration

POST /api/analyze/crawl
Content-Type: application/json

{
  "url": "https://example.com",
  "maxPages": 100,
  "depthLimit": 2,
  "concurrency": 12,
  "followExternal": false,
  "rateLimitMs": 0,
  "renderAllPages": false
}

200 OK → { success: true, report, version, timestamp }. 400 → missing or invalid URL. 500 → audit failure (error / details). See also API Reference — Endpoints.

Limits & troubleshooting

Crawls use HTTP fetches with a short default timeout; slow or flaky pages surface as ok: false—verify with curl or uptime tools before rewriting rules.
Broken-link checks sample a limited number of internal links per page to keep audits fast; unexplained 404s elsewhere may still exist.
External links are skipped for broken sampling unless configured to reduce noise from third-party outages.
renderAllPages requires puppeteer-core plus a runnable Chrome/Chromium; if unsupported, omit rendering and rely on static HTML extraction.
Outbound requests identify as LLMS-Audit/1.0; allowlisting that user agent on WAFs may be required.

Robots.txt — Basic Usage

robots.txt lives at https://your-domain/robots.txt. Bots fetch it before broad crawling; malformed or contradictory rules waste crawl budget and confuse search and AI spiders alike.

Workflow: configure each User-Agent block, add Allow/Disallow paths, optionally set crawl delay on the wildcard agent, then declare one or more Sitemap URLs. Validate before deploy using POST /api/validate/robots.

After publishing, sanity-check crawl policy with an AI Crawlability Audit run against production.

User-Agent Directives

Each logical block begins with User-agent:<token>. Typical tokens include *(default), vendor-specific spiders (for example Googlebot), and documented AI bots (GPTBot, Claude-Web, Bytespider—verify current official names in each vendor's crawler documentation).

Separate blocks allow different policies—for example disallowing admin routes for * while tightening training bots with another block scoped to documented paths only.

Rule precedence follows the robots exclusion protocol specific to each crawler; when in doubt, keep overlapping blocks simple and test with audits plus vendor webmaster tooling.

Allow/Disallow Rules

Paths are prefix/path expressions relative to protocol/host; trailing wildcards behave per spec and crawler implementation details.
Disallow:with an empty path means "nothing disallowed" under many parsers—the generator emits that pattern explicitly when lists are empty.
Allow: refines exclusions (useful under Disallow-heavy trees).
Separate sensitive areas (staging, dashboards, carts) explicitly instead of relying on secrecy through obscurity.

The REST generator accepts arrays of disallow and allow paths per userAgents[].disallow and allow, plus arbitrary lines in additionalRules for directives not modeled in the schema.

AI Crawler Control

AI-facing crawlers reuse the robots protocol but may advertise distinct user-agents or honor additional signals (RSS, feeds, contractual terms beyond robots). Start from vendor guidance, then:

Add dedicated User-agent sections for bots you want to constrain differently from generic *.
Keep training opt-out paths aligned with your LLMS disclosures and footer legal copy.
Reconcile robots with content paywalls—HTTP 403/401 semantics differ from crawler-specific Disallow semantics.

Generated snippets include commented examples for GPTBot-style agents to speed up authoring—uncomment and tailor them before production.

Sitemap Integration

Declaring Sitemap:<absolute-url> in robots.txt aids discovery even though it is technically optional once search consoles know your URLs. Prefer HTTPS sitemap URLs, include hrefs to sitemap index files when you shard, and keep counts within search engine limits (typically 50k URLs per file and uncompressed size caps—see Large Sitemaps).

The generator returns content plus optional existing (live fetch) and warnings highlighting divergences from your current robots.txt preview.

Sitemap.xml — Creating Sitemaps

POST /api/generate/sitemap with a validated site URL seed. The crawler walks internal links respecting robots when respectRobots: true (default), builds page records from HTML responses, and emits URL sets (optionally chunked with an index when volume demands).

maxPages caps discovered URLs (MAX_PAGES_PER_SITEMAP env upper bound applies server-wide).
maxDepth trims exploration depth.
filterOptions forwards fine-grained crawler filters (paths, link patterns).
verbose toggles crawler diagnostics.

The HTTP response carries jobId, immediate sitemap XML, stats (counts, stopped-early indicator, byte size estimate), and a statusUrl for polling with GET /api/sitemap/status/:jobId.

POST /api/generate/sitemap
Content-Type: application/json

{
  "url": "https://example.com/",
  "maxPages": 50000,
  "maxDepth": 8,
  "respectRobots": true,
  "filterOptions": {},
  "verbose": false
}

URL Prioritization

In advanced mode the builder may emit <priority> hints. Numeric priority is advisory only—search engines approximate importance from linkage, freshness, and query demand. Prefer consistent canonical tagging in HTML (rel=canonical) over aggressive priority tweaking.

Use priority to elevate templates that materially affect navigation (homepage, cornerstone guides), not transactional noise.

Change Frequency (<changefreq>)

changefreq is another advisory hint (never, yearly, weekly, daily…). It does not obligate bots to revisit on that cadence—actual schedules derive from crawl budget and observed change rates.

Set realistic coarse values aligned with editorial cadence. Pair with accurate lastmod timestamps when feasible; inflated frequencies erode trust signals across engines.

Large Sitemaps

Shard into multiple XML files capped under ~50k URLs and ~50MB uncompressed per major engine guidance.
Publish a sitemap-index that references shard URLs; list the index URL in robots.txt.
Warm CDN caches after publishing and verify HTTP 200 responses with gzip/brotli as appropriate.
Track stoppedEarly in API stats—if true, widen limits or deepen crawl thoughtfully.

Video & Image Sitemaps

Standard URL sitemaps help discovery for HTML pages; media-rich catalogs often benefit from extension namespaces documenting videos and images (xmlns:video, xmlns:image) with durations, thumbnails, geo, and captions.

The built-in crawler targets HTML link graphs first. Treat dedicated media sitemap entries as authored XML you maintain alongside programmatic page sitemaps: export metadata from CMS or CDN APIs, attach stable media URLs only, validate against Google/Microsoft schema examples, then reference those files from your index alongside standard URL sets.

For large libraries, segregate shards by locale or CDN partition to localize invalidation workflows.

LLMS.txt — What is it?

LLMS-oriented text surfaces how you intend AI systems—including training crawlers—to handle your web properties: permitted uses, attribution, contact for licensing, feeds to prioritize, etc. Policies may evolve independently of robots.txt exclusions; robots governs crawling mechanics while LLMS prose documents business rules.

This product generates a pragmatic starting document from crawling and signal extraction; lawyers and policy owners must review wording before relying on it in contracts or compliance attestations.

Format Specification

There is no single ratified RFC—treat emerging community guidance alongside your governance team. Practically:

Publish at /llms.txt with UTF-8 encoding and deterministic caching headers.
Structure short sections with imperative statements (training allowed/denied, citation requirements).
Maintain versioning inside the doc (ISO date stamped header) whenever terms change materially.

POST /api/generate/llms
Content-Type: application/json

{
  "url": "https://example.com",
  "maxPages": 50,
  "allowAITraining": true,
  "requireAttribution": true
}

Response returns jobId; poll GET /api/llms/status/:jobId until completed to read content, aiReadinessScore, and analysis metadata (titles, robots/sitemap hints, structured data presence).

AI Optimization

Align disclaimers across LLMS.txt, Terms, and crawl policies so automated summarizers ingest consistent intent.
Expose machine-readable manifests (datasets, embeddings policies) referenced from llms prose.
Measure readiness using the readiness score surfaced in responses as a heuristic, then validate with manual QA.

POST /api/enhance/llms exposes a forwards-compatible enrichment hook accepting content and enhancementType; integrations may augment text with LLM rewriting when enabled server-side.

Content Organization

Prefer top-down narration: identities (who operates the site), scope (sites covered), crawler posture, licensing, attribution clauses, escalation contacts, changelog. Optionally cross-link FAQs and DMCA processes to avoid duplicating long legal prose inside llms-only files.

Use bullet lists sparingly—they parse well via screen readers and ingestion stacks compared to prose walls.

Examples

Skeleton you might adapt after reviewing compliance—placeholders annotated; replace brackets before publishing.

# LLMS Disclosure — ExampleCo — 2026-05-09
Organization: ExampleCo
Site: https://example.com

Training usage: Conditional — generative summaries allowed with attribution.
Citation: Visible attribution linking to canonical URLs required.

Disallowed uses: Competitive model training on paywalled content.

Contact: trust@example.com
Changelog:
- 2026-05-09 Initial publication

SEO Guidelines

Canonicalize duplicated routes; disallow faceted-parameter chaos or reflect intent via parameter handling rules.
Keep redirects shallow (avoid chains); return consistent status codes.
Use structured data where helpful; mismatches harm trust more than omission.
Monitor Search Console equivalents for exclusions linked to unintended robots collisions.

Performance Tips

Prefer edge caching on static bots files (robots, sitemap shards) while ensuring invalidations on deploy.
Compress sitemaps transport-wise (gzip/br) respecting crawler expectations.
During audits, tighten concurrency responsibly on shared tenancy to prevent self-DDoS signatures.

Security Considerations

Secrets never belong in publicly served SEO files.
Use robots exclusions as defense-in-depth only—authenticate sensitive URLs.
Validate inbound analyzer payloads server-side before passing them downstream.
Keep Helmet-derived headers (CSP, HSTS where applicable) orthogonal to crawler hints.

AI Crawler Management

Maintain a living registry mapping user-agents→policy owners accountable for updates.
Document opt-in/opt-out flows mirrored in robots, llms disclosures, and contract riders.
Schedule quarterly reconciliations tying audit deltas to changelog entries.
Pair technical controls with human review when models reinterpret ambiguous policy language.

API Reference — Endpoints

Robots

POST /api/generate/robots
POST /api/validate/robots

Sitemaps

POST /api/generate/sitemap
GET /api/sitemap/status/:jobId
/api/saas/sitemap/* (queue-oriented flows)
/api/seo-engine/* enhanced engine routes
/api/sitemap/admin/* administrative operations

LLMS

POST /api/generate/llms
GET /api/llms/status/:jobId
POST /api/enhance/llms

Analyze & audits

POST /api/analyze/sitemap
GET /api/analyze/sitemap/status/:jobId
GET /api/analyze/sitemap/report/:jobId
GET /api/analyze/sitemap/xml/:jobId
GET /api/analyze/sitemap/json/:jobId
POST /api/analyze/classify
POST /api/analyze/classify-bulk
GET /api/analyze/stats/:jobId
GET /api/analyze/jobs
DELETE /api/analyze/jobs/:jobId
POST /api/analyze/crawl crawlability audit

Health

GET /health

Request / Response Conventions

Unless noted, POST bodies are JSON (application/json).
Successful generations return HTTP 200 with success: true wrappers where applicable.
Async endpoints answer immediately with jobId; poll companion status routes.
Use absolute URLs consistently in payloads to avoid ambiguity around schemes and redirects.

Rate Limiting

Global Express rate limiting applies under /api/* with a configurable window (RATE_LIMIT_MAX caps total hits per rolling interval). Requests hitting certain status or enhancement endpoints may be exempt—see server configuration for authoritative skip rules.

Design clients with exponential backoff, especially for analyzer jobs queued server-side.

Error Handling

400 — validation failures (missing URL, malformed robots body).
404 — unknown job identifiers in ephemeral stores.
500 — crawler/analysis faults; payloads include diagnostic error / details strings suitable for structured logging.

Axios-based clients bubble errors through interceptors configured in app/lib/api.ts; map status codes centrally to telemetry and user-visible retry affordances.

SDKs

The web repo ships Axios helpers exporting endpoints.robots, endpoints.sitemap, and endpoints.llms; extend mirrors for analyze routes as needed while keeping base URL normalization consistent.

For other ecosystems, scaffold thin SDKs atop OpenAPI-derived clients once you stabilize schemas—prioritize retries, timeouts, typed error unions, and stream-friendly handling for analyzer downloads.

import { endpoints } from '@/lib/api'

const { data } = await endpoints.robots.generate({
  url: 'https://example.com',
  userAgents: [{ name: '*', disallow: ['/private/'], allow: [] }],
})

console.log(data.content)

Need Help?

Reach out through support channels or browse common questions—we iterate documentation alongside API changes.

Contact Support Visit FAQ

Loading...

Preparing your content

Documentation

Learn how to use robots.txt, sitemaps, and LLMS.txt together with audits and APIs so search engines and AI crawlers can understand your site reliably.

Quick Start Guide

Ship your first files in minutes

API Reference

Integrate generators and audits programmatically

Introduction

Core capabilities

Robots.txt Generator — model user-agent blocks, Allow/Disallow paths, crawl hints, and sitemap declarations.
AI Crawlability Audit — crawl sitemap seeds and internal links, sample broken links, and score robots/sitemap posture (full guide).
Sitemap.xml Generator — crawl from a starting URL and emit standards-compliant XML, with indexing when URL counts grow.
LLMS.txt Generator — summarize compliance-oriented signals into an AI-readable document tailored to training and attribution preferences.

Prefer the UI first, then automate with the HTTP API documented under Endpoints.

Quick Start Guide

Robots.txt
Open Robots.txt Generator, enter your site URL, tune user agents and paths, generate the file, and place /robots.txt at the host root.
Sitemap
Use Sitemap.xml Generator with your canonical HTTPS origin. Download or copy XML, publish at /sitemap.xml (or reference the URL in robots.txt).
LLMS.txt
Generate from LLMS.txt Generator; review wording for legal accuracy before publishing /llms.txt.
Audit
Optionally run AI Crawlability Audit (docs) to validate live robots.txt, declared sitemaps, and internal crawl health.

API Overview

Responses are JSON unless you download artifacts (for example analyzer reports). A separate liveness probe lives at GET /health on the same host as the Express app (without the /api prefix).

POST /api/generate/robots
Content-Type: application/json

{
  "url": "https://example.com",
  "userAgents": [
    { "name": "*", "disallow": ["/admin/", "/api/"], "allow": [] }
  ],
  "sitemapUrl": "https://example.com/sitemap.xml",
  "crawlDelay": 1,
  "additionalRules": ["# staging rules"]
}

Browse all endpoints

Authentication

AI Crawlability Audit — Overview

It complements the Robots, Sitemap, and LLMS.txt tools: those help you author policies and files; the audit checks what is live on the wire.

Running an audit (web UI)

Options & presets

Request body fields

url — required starting origin; normalized server-side.
maxPages — max URLs processed (default 100 in server and UI).
depthLimit — max link hops from each seed (default 2).
concurrency — parallel in-flight requests, clamped roughly 1–20 on the server (default 12).
followExternal — follow off-domain links matching host rules when true (usually leave off unless you intentionally audit cross-domain properties).
rateLimitMs — optional minimum delay between outbound requests.
renderAllPages — when true, attempts Puppeteer-based rendering where available to capture JS-only links (much slower).

Built-in presets (UI)

Fast— fewer pages and depth, higher concurrency, rendering off (~1 minute class runs on typical sites).
Balanced — ~100 pages, depth 2, default concurrency, rendering off.
Thorough — more pages and depth, lower concurrency, rendering on (longer runs; requires a working headless browser install for render mode).

Report structure

The API wraps the payload as { success, report, version, timestamp }. The report object includes:

base_url, audit_date_utc, total_urls_discovered
sitemap_urls — URLs collected from robots and common sitemap locations.
robots — simplified allow, disallow, sitemaps extracted from robots.txt (missing file is non-fatal).
crawl_results_summary — final_score (0–100), crawlable_urls, broken_count, advisory sitemap_score and robots_score (each typically 5 or 10 in the heuristic).
broken_links_sample — up to hundreds of problematic URLs/strings.
per_url — map of URL → status, ok, canonical, metaRobots, links, out_broken_sample (broken internal sample per page), hasQuery, depth.
human_summary — multi-line text summary aligned with scores.

Persist or diff reports by audit_date_utc and canonical base_url when trending improvements across deploys.

How scoring works

API integration

POST /api/analyze/crawl
Content-Type: application/json

{
  "url": "https://example.com",
  "maxPages": 100,
  "depthLimit": 2,
  "concurrency": 12,
  "followExternal": false,
  "rateLimitMs": 0,
  "renderAllPages": false
}

200 OK → { success: true, report, version, timestamp }. 400 → missing or invalid URL. 500 → audit failure (error / details). See also API Reference — Endpoints.

Limits & troubleshooting

Crawls use HTTP fetches with a short default timeout; slow or flaky pages surface as ok: false—verify with curl or uptime tools before rewriting rules.
Broken-link checks sample a limited number of internal links per page to keep audits fast; unexplained 404s elsewhere may still exist.
External links are skipped for broken sampling unless configured to reduce noise from third-party outages.
renderAllPages requires puppeteer-core plus a runnable Chrome/Chromium; if unsupported, omit rendering and rely on static HTML extraction.
Outbound requests identify as LLMS-Audit/1.0; allowlisting that user agent on WAFs may be required.

Robots.txt — Basic Usage

robots.txt lives at https://your-domain/robots.txt. Bots fetch it before broad crawling; malformed or contradictory rules waste crawl budget and confuse search and AI spiders alike.

After publishing, sanity-check crawl policy with an AI Crawlability Audit run against production.

User-Agent Directives

Separate blocks allow different policies—for example disallowing admin routes for * while tightening training bots with another block scoped to documented paths only.

Rule precedence follows the robots exclusion protocol specific to each crawler; when in doubt, keep overlapping blocks simple and test with audits plus vendor webmaster tooling.

Allow/Disallow Rules

Paths are prefix/path expressions relative to protocol/host; trailing wildcards behave per spec and crawler implementation details.
Disallow:with an empty path means "nothing disallowed" under many parsers—the generator emits that pattern explicitly when lists are empty.
Allow: refines exclusions (useful under Disallow-heavy trees).
Separate sensitive areas (staging, dashboards, carts) explicitly instead of relying on secrecy through obscurity.

The REST generator accepts arrays of disallow and allow paths per userAgents[].disallow and allow, plus arbitrary lines in additionalRules for directives not modeled in the schema.

AI Crawler Control

AI-facing crawlers reuse the robots protocol but may advertise distinct user-agents or honor additional signals (RSS, feeds, contractual terms beyond robots). Start from vendor guidance, then:

Add dedicated User-agent sections for bots you want to constrain differently from generic *.
Keep training opt-out paths aligned with your LLMS disclosures and footer legal copy.
Reconcile robots with content paywalls—HTTP 403/401 semantics differ from crawler-specific Disallow semantics.

Generated snippets include commented examples for GPTBot-style agents to speed up authoring—uncomment and tailor them before production.

Sitemap Integration

The generator returns content plus optional existing (live fetch) and warnings highlighting divergences from your current robots.txt preview.

Sitemap.xml — Creating Sitemaps

maxPages caps discovered URLs (MAX_PAGES_PER_SITEMAP env upper bound applies server-wide).
maxDepth trims exploration depth.
filterOptions forwards fine-grained crawler filters (paths, link patterns).
verbose toggles crawler diagnostics.

The HTTP response carries jobId, immediate sitemap XML, stats (counts, stopped-early indicator, byte size estimate), and a statusUrl for polling with GET /api/sitemap/status/:jobId.

POST /api/generate/sitemap
Content-Type: application/json

{
  "url": "https://example.com/",
  "maxPages": 50000,
  "maxDepth": 8,
  "respectRobots": true,
  "filterOptions": {},
  "verbose": false
}

URL Prioritization

Use priority to elevate templates that materially affect navigation (homepage, cornerstone guides), not transactional noise.

Change Frequency (<changefreq>)

changefreq is another advisory hint (never, yearly, weekly, daily…). It does not obligate bots to revisit on that cadence—actual schedules derive from crawl budget and observed change rates.

Set realistic coarse values aligned with editorial cadence. Pair with accurate lastmod timestamps when feasible; inflated frequencies erode trust signals across engines.

Large Sitemaps

Shard into multiple XML files capped under ~50k URLs and ~50MB uncompressed per major engine guidance.
Publish a sitemap-index that references shard URLs; list the index URL in robots.txt.
Warm CDN caches after publishing and verify HTTP 200 responses with gzip/brotli as appropriate.
Track stoppedEarly in API stats—if true, widen limits or deepen crawl thoughtfully.

Video & Image Sitemaps

For large libraries, segregate shards by locale or CDN partition to localize invalidation workflows.

LLMS.txt — What is it?

This product generates a pragmatic starting document from crawling and signal extraction; lawyers and policy owners must review wording before relying on it in contracts or compliance attestations.

Format Specification

There is no single ratified RFC—treat emerging community guidance alongside your governance team. Practically:

Publish at /llms.txt with UTF-8 encoding and deterministic caching headers.
Structure short sections with imperative statements (training allowed/denied, citation requirements).
Maintain versioning inside the doc (ISO date stamped header) whenever terms change materially.

POST /api/generate/llms
Content-Type: application/json

{
  "url": "https://example.com",
  "maxPages": 50,
  "allowAITraining": true,
  "requireAttribution": true
}

Response returns jobId; poll GET /api/llms/status/:jobId until completed to read content, aiReadinessScore, and analysis metadata (titles, robots/sitemap hints, structured data presence).

AI Optimization

Align disclaimers across LLMS.txt, Terms, and crawl policies so automated summarizers ingest consistent intent.
Expose machine-readable manifests (datasets, embeddings policies) referenced from llms prose.
Measure readiness using the readiness score surfaced in responses as a heuristic, then validate with manual QA.

POST /api/enhance/llms exposes a forwards-compatible enrichment hook accepting content and enhancementType; integrations may augment text with LLM rewriting when enabled server-side.

Content Organization

Use bullet lists sparingly—they parse well via screen readers and ingestion stacks compared to prose walls.

Examples

Skeleton you might adapt after reviewing compliance—placeholders annotated; replace brackets before publishing.

# LLMS Disclosure — ExampleCo — 2026-05-09
Organization: ExampleCo
Site: https://example.com

Training usage: Conditional — generative summaries allowed with attribution.
Citation: Visible attribution linking to canonical URLs required.

Disallowed uses: Competitive model training on paywalled content.

Contact: trust@example.com
Changelog:
- 2026-05-09 Initial publication

SEO Guidelines

Canonicalize duplicated routes; disallow faceted-parameter chaos or reflect intent via parameter handling rules.
Keep redirects shallow (avoid chains); return consistent status codes.
Use structured data where helpful; mismatches harm trust more than omission.
Monitor Search Console equivalents for exclusions linked to unintended robots collisions.

Performance Tips

Prefer edge caching on static bots files (robots, sitemap shards) while ensuring invalidations on deploy.
Compress sitemaps transport-wise (gzip/br) respecting crawler expectations.
During audits, tighten concurrency responsibly on shared tenancy to prevent self-DDoS signatures.

Security Considerations

Secrets never belong in publicly served SEO files.
Use robots exclusions as defense-in-depth only—authenticate sensitive URLs.
Validate inbound analyzer payloads server-side before passing them downstream.
Keep Helmet-derived headers (CSP, HSTS where applicable) orthogonal to crawler hints.

AI Crawler Management

Maintain a living registry mapping user-agents→policy owners accountable for updates.
Document opt-in/opt-out flows mirrored in robots, llms disclosures, and contract riders.
Schedule quarterly reconciliations tying audit deltas to changelog entries.
Pair technical controls with human review when models reinterpret ambiguous policy language.

API Reference — Endpoints

Robots

POST /api/generate/robots
POST /api/validate/robots

Sitemaps

POST /api/generate/sitemap
GET /api/sitemap/status/:jobId
/api/saas/sitemap/* (queue-oriented flows)
/api/seo-engine/* enhanced engine routes
/api/sitemap/admin/* administrative operations

LLMS

POST /api/generate/llms
GET /api/llms/status/:jobId
POST /api/enhance/llms

Analyze & audits

POST /api/analyze/sitemap
GET /api/analyze/sitemap/status/:jobId
GET /api/analyze/sitemap/report/:jobId
GET /api/analyze/sitemap/xml/:jobId
GET /api/analyze/sitemap/json/:jobId
POST /api/analyze/classify
POST /api/analyze/classify-bulk
GET /api/analyze/stats/:jobId
GET /api/analyze/jobs
DELETE /api/analyze/jobs/:jobId
POST /api/analyze/crawl crawlability audit

Health

GET /health

Request / Response Conventions

Unless noted, POST bodies are JSON (application/json).
Successful generations return HTTP 200 with success: true wrappers where applicable.
Async endpoints answer immediately with jobId; poll companion status routes.
Use absolute URLs consistently in payloads to avoid ambiguity around schemes and redirects.

Rate Limiting

Design clients with exponential backoff, especially for analyzer jobs queued server-side.

Error Handling

400 — validation failures (missing URL, malformed robots body).
404 — unknown job identifiers in ephemeral stores.
500 — crawler/analysis faults; payloads include diagnostic error / details strings suitable for structured logging.

Axios-based clients bubble errors through interceptors configured in app/lib/api.ts; map status codes centrally to telemetry and user-visible retry affordances.

SDKs

The web repo ships Axios helpers exporting endpoints.robots, endpoints.sitemap, and endpoints.llms; extend mirrors for analyze routes as needed while keeping base URL normalization consistent.

import { endpoints } from '@/lib/api'

const { data } = await endpoints.robots.generate({
  url: 'https://example.com',
  userAgents: [{ name: '*', disallow: ['/private/'], allow: [] }],
})

console.log(data.content)

Need Help?

Reach out through support channels or browse common questions—we iterate documentation alongside API changes.

Contact Support Visit FAQ