Loading...
Preparing your content
Preparing your content
Learn how to use robots.txt, sitemaps, and LLMS.txt together with audits and APIs so search engines and AI crawlers can understand your site reliably.
Ship your first files in minutes
Integrate generators and audits programmatically
This platform helps you produce and validate the files that define how bots discover and use your content: robots.txt for crawl policies, XML sitemaps for URL discovery, and LLMS-oriented documentation where you describe how AI systems should treat your pages.
Core capabilities
Prefer the UI first, then automate with the HTTP API documented under Endpoints.
Open Robots.txt Generator, enter your site URL, tune user agents and paths, generate the file, and place /robots.txt at the host root.
Use Sitemap.xml Generator with your canonical HTTPS origin. Download or copy XML, publish at /sitemap.xml (or reference the URL in robots.txt).
Generate from LLMS.txt Generator; review wording for legal accuracy before publishing /llms.txt.
Optionally run AI Crawlability Audit (docs) to validate live robots.txt, declared sitemaps, and internal crawl health.
The backend exposes REST endpoints under /api. The Next.js app typically sets NEXT_PUBLIC_API_BASE_URL (or NEXT_PUBLIC_API_URL) to your server origin plus /api, for example http://localhost:3000/api.
Responses are JSON unless you download artifacts (for example analyzer reports). A separate liveness probe lives at GET /health on the same host as the Express app (without the /api prefix).
POST /api/generate/robots
Content-Type: application/json
{
"url": "https://example.com",
"userAgents": [
{ "name": "*", "disallow": ["/admin/", "/api/"], "allow": [] }
],
"sitemapUrl": "https://example.com/sitemap.xml",
"crawlDelay": 1,
"additionalRules": ["# staging rules"]
}Generator and audit endpoints are intended for authenticated frontends or trusted backends. Today the API accepts optional Authorization: Bearer <token> headers for forward compatibility—the bundled web client attaches a stored token when present.
For production integrations, terminate TLS at your edge, restrict origins via ALLOWED_ORIGINS, and place the API behind your own API gateway or service mesh if you need per-key quotas distinct from IP-based rate limiting.
The AI Crawlability Audit is a bounded HTTP crawl of your site that focuses on how easily machines can discover and fetch your pages. It discovers URLs from robots.txt sitemap declarations and common locations (for example /sitemap.xml), follows internal links up to a depth and page cap, records status codes, canonical tags, and meta name="robots", samples internal links for broken responses, flags query-string URLs, and combines that into a scored report with a plain-text human_summary.
It complements the Robots, Sitemap, and LLMS.txt tools: those help you author policies and files; the audit checks what is live on the wire.
Open /crawl-audit, enter a full site URL (for example https://example.com), adjust options if needed, and choose Run Audit. The page POSTs to your API base (from NEXT_PUBLIC_API_BASE_URL or NEXT_PUBLIC_API_URL) at .../api/analyze/crawl, or to /api/analyze/crawl when no base is set (same-origin proxy).
Successful runs render Summary, Robots & Sitemap, a per-URL table, broken-link samples, and query-parameter URLs. Use that layout as a checklist when building your own UI on top of the JSON report.
Request body fields
url — required starting origin; normalized server-side.maxPages — max URLs processed (default 100 in server and UI).depthLimit — max link hops from each seed (default 2).concurrency — parallel in-flight requests, clamped roughly 1–20 on the server (default 12).followExternal — follow off-domain links matching host rules when true (usually leave off unless you intentionally audit cross-domain properties).rateLimitMs — optional minimum delay between outbound requests.renderAllPages — when true, attempts Puppeteer-based rendering where available to capture JS-only links (much slower).Built-in presets (UI)
The API wraps the payload as { success, report, version, timestamp }. The report object includes:
base_url, audit_date_utc, total_urls_discoveredsitemap_urls — URLs collected from robots and common sitemap locations.robots — simplified allow, disallow, sitemaps extracted from robots.txt (missing file is non-fatal).crawl_results_summary — final_score (0–100), crawlable_urls, broken_count, advisory sitemap_score and robots_score (each typically 5 or 10 in the heuristic).broken_links_sample — up to hundreds of problematic URLs/strings.per_url — map of URL → status, ok, canonical, metaRobots, links, out_broken_sample (broken internal sample per page), hasQuery, depth.human_summary — multi-line text summary aligned with scores.Persist or diff reports by audit_date_utc and canonical base_url when trending improvements across deploys.
The composite final_score (0–100) blends heuristic sitemap discovery, rudimentary robots presence signals, successful fetch ratio across crawled URLs, and a penalty scaled from sampled broken/internal errors. Exact weighting lives in crawlAuditService on the server—treat it as a directional health indicator, not a substitute for Google Search Console, log analysis, or vendor-specific crawl simulators.
Use sitemap_score / robots_score at face value only for quick regressions between deploys; they reward having discoverable sitemaps and a non-empty robots posture rather than deep semantic correctness of every directive.
POST /api/analyze/crawl
Content-Type: application/json
{
"url": "https://example.com",
"maxPages": 100,
"depthLimit": 2,
"concurrency": 12,
"followExternal": false,
"rateLimitMs": 0,
"renderAllPages": false
}200 OK → { success: true, report, version, timestamp }. 400 → missing or invalid URL. 500 → audit failure (error / details). See also API Reference — Endpoints.
ok: false—verify with curl or uptime tools before rewriting rules.renderAllPages requires puppeteer-core plus a runnable Chrome/Chromium; if unsupported, omit rendering and rely on static HTML extraction.LLMS-Audit/1.0; allowlisting that user agent on WAFs may be required.robots.txt lives at https://your-domain/robots.txt. Bots fetch it before broad crawling; malformed or contradictory rules waste crawl budget and confuse search and AI spiders alike.
Workflow: configure each User-Agent block, add Allow/Disallow paths, optionally set crawl delay on the wildcard agent, then declare one or more Sitemap URLs. Validate before deploy using POST /api/validate/robots.
After publishing, sanity-check crawl policy with an AI Crawlability Audit run against production.
Each logical block begins with User-agent:<token>. Typical tokens include *(default), vendor-specific spiders (for example Googlebot), and documented AI bots (GPTBot, Claude-Web, Bytespider—verify current official names in each vendor's crawler documentation).
Separate blocks allow different policies—for example disallowing admin routes for * while tightening training bots with another block scoped to documented paths only.
Rule precedence follows the robots exclusion protocol specific to each crawler; when in doubt, keep overlapping blocks simple and test with audits plus vendor webmaster tooling.
Disallow:with an empty path means "nothing disallowed" under many parsers—the generator emits that pattern explicitly when lists are empty.Allow: refines exclusions (useful under Disallow-heavy trees).The REST generator accepts arrays of disallow and allow paths per userAgents[].disallow and allow, plus arbitrary lines in additionalRules for directives not modeled in the schema.
AI-facing crawlers reuse the robots protocol but may advertise distinct user-agents or honor additional signals (RSS, feeds, contractual terms beyond robots). Start from vendor guidance, then:
Generated snippets include commented examples for GPTBot-style agents to speed up authoring—uncomment and tailor them before production.
Declaring Sitemap:<absolute-url> in robots.txt aids discovery even though it is technically optional once search consoles know your URLs. Prefer HTTPS sitemap URLs, include hrefs to sitemap index files when you shard, and keep counts within search engine limits (typically 50k URLs per file and uncompressed size caps—see Large Sitemaps).
The generator returns content plus optional existing (live fetch) and warnings highlighting divergences from your current robots.txt preview.
POST /api/generate/sitemap with a validated site URL seed. The crawler walks internal links respecting robots when respectRobots: true (default), builds page records from HTML responses, and emits URL sets (optionally chunked with an index when volume demands).
maxPages caps discovered URLs (MAX_PAGES_PER_SITEMAP env upper bound applies server-wide).maxDepth trims exploration depth.filterOptions forwards fine-grained crawler filters (paths, link patterns).verbose toggles crawler diagnostics.The HTTP response carries jobId, immediate sitemap XML, stats (counts, stopped-early indicator, byte size estimate), and a statusUrl for polling with GET /api/sitemap/status/:jobId.
POST /api/generate/sitemap
Content-Type: application/json
{
"url": "https://example.com/",
"maxPages": 50000,
"maxDepth": 8,
"respectRobots": true,
"filterOptions": {},
"verbose": false
}In advanced mode the builder may emit <priority> hints. Numeric priority is advisory only—search engines approximate importance from linkage, freshness, and query demand. Prefer consistent canonical tagging in HTML (rel=canonical) over aggressive priority tweaking.
Use priority to elevate templates that materially affect navigation (homepage, cornerstone guides), not transactional noise.
changefreq is another advisory hint (never, yearly, weekly, daily…). It does not obligate bots to revisit on that cadence—actual schedules derive from crawl budget and observed change rates.
Set realistic coarse values aligned with editorial cadence. Pair with accurate lastmod timestamps when feasible; inflated frequencies erode trust signals across engines.
sitemap-index that references shard URLs; list the index URL in robots.txt.stoppedEarly in API stats—if true, widen limits or deepen crawl thoughtfully.Standard URL sitemaps help discovery for HTML pages; media-rich catalogs often benefit from extension namespaces documenting videos and images (xmlns:video, xmlns:image) with durations, thumbnails, geo, and captions.
The built-in crawler targets HTML link graphs first. Treat dedicated media sitemap entries as authored XML you maintain alongside programmatic page sitemaps: export metadata from CMS or CDN APIs, attach stable media URLs only, validate against Google/Microsoft schema examples, then reference those files from your index alongside standard URL sets.
For large libraries, segregate shards by locale or CDN partition to localize invalidation workflows.
LLMS-oriented text surfaces how you intend AI systems—including training crawlers—to handle your web properties: permitted uses, attribution, contact for licensing, feeds to prioritize, etc. Policies may evolve independently of robots.txt exclusions; robots governs crawling mechanics while LLMS prose documents business rules.
This product generates a pragmatic starting document from crawling and signal extraction; lawyers and policy owners must review wording before relying on it in contracts or compliance attestations.
There is no single ratified RFC—treat emerging community guidance alongside your governance team. Practically:
/llms.txt with UTF-8 encoding and deterministic caching headers.POST /api/generate/llms
Content-Type: application/json
{
"url": "https://example.com",
"maxPages": 50,
"allowAITraining": true,
"requireAttribution": true
}Response returns jobId; poll GET /api/llms/status/:jobId until completed to read content, aiReadinessScore, and analysis metadata (titles, robots/sitemap hints, structured data presence).
POST /api/enhance/llms exposes a forwards-compatible enrichment hook accepting content and enhancementType; integrations may augment text with LLM rewriting when enabled server-side.
Prefer top-down narration: identities (who operates the site), scope (sites covered), crawler posture, licensing, attribution clauses, escalation contacts, changelog. Optionally cross-link FAQs and DMCA processes to avoid duplicating long legal prose inside llms-only files.
Use bullet lists sparingly—they parse well via screen readers and ingestion stacks compared to prose walls.
Skeleton you might adapt after reviewing compliance—placeholders annotated; replace brackets before publishing.
# LLMS Disclosure — ExampleCo — 2026-05-09
Organization: ExampleCo
Site: https://example.com
Training usage: Conditional — generative summaries allowed with attribution.
Citation: Visible attribution linking to canonical URLs required.
Disallowed uses: Competitive model training on paywalled content.
Contact: trust@example.com
Changelog:
- 2026-05-09 Initial publicationRobots
POST /api/generate/robotsPOST /api/validate/robotsSitemaps
POST /api/generate/sitemapGET /api/sitemap/status/:jobId/api/saas/sitemap/* (queue-oriented flows)/api/seo-engine/* enhanced engine routes/api/sitemap/admin/* administrative operationsLLMS
POST /api/generate/llmsGET /api/llms/status/:jobIdPOST /api/enhance/llmsAnalyze & audits
POST /api/analyze/sitemapGET /api/analyze/sitemap/status/:jobIdGET /api/analyze/sitemap/report/:jobIdGET /api/analyze/sitemap/xml/:jobIdGET /api/analyze/sitemap/json/:jobIdPOST /api/analyze/classifyPOST /api/analyze/classify-bulkGET /api/analyze/stats/:jobIdGET /api/analyze/jobsDELETE /api/analyze/jobs/:jobIdPOST /api/analyze/crawl crawlability auditHealth
GET /healthapplication/json).success: true wrappers where applicable.jobId; poll companion status routes.Global Express rate limiting applies under /api/* with a configurable window (RATE_LIMIT_MAX caps total hits per rolling interval). Requests hitting certain status or enhancement endpoints may be exempt—see server configuration for authoritative skip rules.
Design clients with exponential backoff, especially for analyzer jobs queued server-side.
error / details strings suitable for structured logging.Axios-based clients bubble errors through interceptors configured in app/lib/api.ts; map status codes centrally to telemetry and user-visible retry affordances.
The web repo ships Axios helpers exporting endpoints.robots, endpoints.sitemap, and endpoints.llms; extend mirrors for analyze routes as needed while keeping base URL normalization consistent.
For other ecosystems, scaffold thin SDKs atop OpenAPI-derived clients once you stabilize schemas—prioritize retries, timeouts, typed error unions, and stream-friendly handling for analyzer downloads.
import { endpoints } from '@/lib/api'
const { data } = await endpoints.robots.generate({
url: 'https://example.com',
userAgents: [{ name: '*', disallow: ['/private/'], allow: [] }],
})
console.log(data.content)Reach out through support channels or browse common questions—we iterate documentation alongside API changes.